Package org.fit.pdfdom
Class PDFBoxTree
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.fit.pdfdom.PDFBoxTree
-
- Direct Known Subclasses:
PDFDomTree
public abstract class PDFBoxTree extends org.apache.pdfbox.text.PDFTextStripperA generic tree of boxes created from a PDF file. It processes the PDF document and calls the appropriate abstract methods in order to render a page, text box, etc. The particular implementations are expected to implement these actions in order to build the resulting document tree.- Author:
- burgetr
-
-
Field Summary
Fields Modifier and Type Field Description protected static String[]cssFontFamilyKnown font names that are recognized in the PDF filesprotected static String[]cssFontStyleFont styles corresponding to the font subtypes inpdFontTypeprotected static String[]cssFontWeightFont weights corresponding to the font subtypes inpdFontTypeprotected floatcur_xCurrent text coordinates (the coordinates of the last encountered text box).protected floatcur_yCurrent text coordinates (the coordinates of the last encountered text box).protected BoxStylecurstyleThe style of the text line being createdprotected booleandisableGraphicsWhen set totrue, the graphics in the PDF file will be ignored.protected booleandisableImageDataWhen set totrue, the image data will not be transferred to the HTML data: url.protected booleandisableImagesWhen set totrue, the embedded images will be ignored.protected intendPageLast page to be processedprotected FontTablefontTableTable of embedded fontsprotected Vector<PathSegment>graphicsPathCurrent graphics pathprotected org.apache.pdfbox.text.TextPositionlastDiaLast diacritic if anyprotected org.apache.pdfbox.text.TextPositionlastTextPrevious positioned text.protected floatpath_start_xStarting path construction positionprotected floatpath_start_yStarting path construction positionprotected floatpath_xCurrent path construction positionprotected floatpath_yCurrent path construction positionprotected static String[]pdFontTypeKnown font subtypes recognized in PDF filesprotected org.apache.pdfbox.pdmodel.PDPagepdpageThe PDF page currently being processedprotected intstartPageFirst page to be processedprotected BoxStylestyleThe style of the future box being modified by the operatorsprotected StringBuildertextLineThe text box currently being created.protected TextMetricstextMetricsCurrent text line metricsstatic StringUNITLength units used in the generated CSS
-
Constructor Summary
Constructors Constructor Description PDFBoxTree()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected StringcolorString(float r, float g, float b)Creates a CSS rgb() specification from the color component values.protected StringcolorString(int ir, int ig, int ib)Creates a CSS rgb() specification from the color component values.protected StringcolorString(org.apache.pdfbox.pdmodel.graphics.color.PDColor pdcolor)Creates a CSS rgb specification from a PDF colorprotected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)protected AffineTransformcreateCurrentPageTransformation()protected voidfinishBox()Finishes the current box - empties the text line buffer and creates a DOM element from it.protected floatfloatValue(org.apache.pdfbox.cos.COSBase value)Obtains a number from a PDF number valueprotected org.apache.pdfbox.pdmodel.common.PDRectanglegetCurrentMediaBox()Obtains the media box valid for the current page.booleangetDisableGraphics()Checks whether the graphics processing is disabled.booleangetDisableImageData()Checks whether the copying of image data is disabled.booleangetDisableImages()Checks whether processing of embedded images is disabled.intgetEndPage()protected floatgetLength(org.apache.pdfbox.cos.COSBase value)Obtains a length in points from a PDF number valueintgetStartPage()protected bytegetTextDirectionality(String s)protected bytegetTextDirectionality(org.apache.pdfbox.text.TextPosition text)protected StringgetTitle()protected intintValue(org.apache.pdfbox.cos.COSBase value)Obtains a number from a PDF number valueprotected booleanisReversed(byte directionality)Checks whether the text directionality corresponds to reversed text (very rough)protected voidprocessImageOperation(List<org.apache.pdfbox.cos.COSBase> arguments)protected voidprocessOperator(org.apache.pdfbox.contentstream.operator.Operator operator, List<org.apache.pdfbox.cos.COSBase> arguments)voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page)protected voidprocessTextPosition(org.apache.pdfbox.text.TextPosition text)protected abstract voidrenderImage(float x, float y, float width, float height, ImageResource data)Adds an image to the current page.protected abstract voidrenderPath(List<PathSegment> path, boolean stroke, boolean fill)Adds a rectangle to the current page on the specified position.protected abstract voidrenderText(String data, TextMetrics metrics)Creates a new text box in the current page.voidsetDisableGraphics(boolean disableGraphics)Disables the processing of the graphic operators in the PDF files.voidsetDisableImageData(boolean disableImageData)Disables the copying the image data to the resulting DOM tree.voidsetDisableImages(boolean disableImages)Disables the processing of images contained in the PDF files.voidsetEndPage(int endPage)voidsetStartPage(int startPage)protected voidshowGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)protected abstract voidstartNewPage()Adds a new page to the resulting document and makes it a current (active) page.protected StringstringValue(org.apache.pdfbox.cos.COSBase value)Obtains a string from a PDF valueprotected float[]toRectangle(List<PathSegment> path)protected floattransformLength(float w)Transforms a length according to the current transformation matrix.protected float[]transformPosition(float x, float y)Transforms a position according to the current transformation matrix and current page transformation.protected voidupdateFontTable()Updates the font table by adding new fonts used at the current page.protected voidupdateStyle(BoxStyle bstyle, org.apache.pdfbox.text.TextPosition text)Updates the text style according to a new text position-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCurrentPageNo, getDropThreshold, getEndBookmark, getCharactersByArticle, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
UNIT
public static final String UNIT
Length units used in the generated CSS- See Also:
- Constant Field Values
-
cssFontFamily
protected static String[] cssFontFamily
Known font names that are recognized in the PDF files
-
pdFontType
protected static String[] pdFontType
Known font subtypes recognized in PDF files
-
cssFontWeight
protected static String[] cssFontWeight
Font weights corresponding to the font subtypes inpdFontType
-
cssFontStyle
protected static String[] cssFontStyle
Font styles corresponding to the font subtypes inpdFontType
-
disableGraphics
protected boolean disableGraphics
When set totrue, the graphics in the PDF file will be ignored.
-
disableImages
protected boolean disableImages
When set totrue, the embedded images will be ignored.
-
disableImageData
protected boolean disableImageData
When set totrue, the image data will not be transferred to the HTML data: url.
-
startPage
protected int startPage
First page to be processed
-
endPage
protected int endPage
Last page to be processed
-
fontTable
protected FontTable fontTable
Table of embedded fonts
-
pdpage
protected org.apache.pdfbox.pdmodel.PDPage pdpage
The PDF page currently being processed
-
cur_x
protected float cur_x
Current text coordinates (the coordinates of the last encountered text box).
-
cur_y
protected float cur_y
Current text coordinates (the coordinates of the last encountered text box).
-
path_x
protected float path_x
Current path construction position
-
path_y
protected float path_y
Current path construction position
-
path_start_x
protected float path_start_x
Starting path construction position
-
path_start_y
protected float path_start_y
Starting path construction position
-
lastText
protected org.apache.pdfbox.text.TextPosition lastText
Previous positioned text.
-
lastDia
protected org.apache.pdfbox.text.TextPosition lastDia
Last diacritic if any
-
textLine
protected StringBuilder textLine
The text box currently being created.
-
textMetrics
protected TextMetrics textMetrics
Current text line metrics
-
graphicsPath
protected Vector<PathSegment> graphicsPath
Current graphics path
-
style
protected BoxStyle style
The style of the future box being modified by the operators
-
curstyle
protected BoxStyle curstyle
The style of the text line being created
-
-
Constructor Detail
-
PDFBoxTree
public PDFBoxTree() throws IOException- Throws:
IOException
-
-
Method Detail
-
processPage
public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException- Overrides:
processPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
getDisableGraphics
public boolean getDisableGraphics()
Checks whether the graphics processing is disabled.- Returns:
truewhen the graphics processing is disabled in the parser configuration.
-
setDisableGraphics
public void setDisableGraphics(boolean disableGraphics)
Disables the processing of the graphic operators in the PDF files.- Parameters:
disableGraphics- when set totruethe graphics is ignored in the source file.
-
getDisableImages
public boolean getDisableImages()
Checks whether processing of embedded images is disabled.- Returns:
truewhen the processing of embedded images is disabled in the parser configuration.
-
setDisableImages
public void setDisableImages(boolean disableImages)
Disables the processing of images contained in the PDF files.- Parameters:
disableImages- when set totruethe images are ignored in the source file.
-
getDisableImageData
public boolean getDisableImageData()
Checks whether the copying of image data is disabled.- Returns:
truewhen the copying of image data is disabled in the parser configuration.
-
setDisableImageData
public void setDisableImageData(boolean disableImageData)
Disables the copying the image data to the resulting DOM tree.- Parameters:
disableImageData- when set totruethe image data is not copied to the document tree. The eventualimgelements will have an emptysrcattribute.
-
getStartPage
public int getStartPage()
- Overrides:
getStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-
setStartPage
public void setStartPage(int startPage)
- Overrides:
setStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-
getEndPage
public int getEndPage()
- Overrides:
getEndPagein classorg.apache.pdfbox.text.PDFTextStripper
-
setEndPage
public void setEndPage(int endPage)
- Overrides:
setEndPagein classorg.apache.pdfbox.text.PDFTextStripper
-
startNewPage
protected abstract void startNewPage()
Adds a new page to the resulting document and makes it a current (active) page.
-
renderText
protected abstract void renderText(String data, TextMetrics metrics)
Creates a new text box in the current page. The style and position of the text are contained in thecurstyleproperty.- Parameters:
data- The text contents.
-
renderPath
protected abstract void renderPath(List<PathSegment> path, boolean stroke, boolean fill) throws IOException
Adds a rectangle to the current page on the specified position.- Parameters:
rect- the rectangle to be renderedstroke- should there be a stroke around?fill- should the rectangle be filled?- Throws:
IOException
-
renderImage
protected abstract void renderImage(float x, float y, float width, float height, ImageResource data) throws IOExceptionAdds an image to the current page.- Parameters:
type- the image type:"png"or"jpeg"x- the X coordinate of the imagey- the Y coordinate of the imagewidth- the width coordinate of the imageheight- the height coordinate of the imagedata- the image data depending on the specified type- Throws:
IOException
-
toRectangle
protected float[] toRectangle(List<PathSegment> path)
-
updateFontTable
protected void updateFontTable()
Updates the font table by adding new fonts used at the current page.
-
processOperator
protected void processOperator(org.apache.pdfbox.contentstream.operator.Operator operator, List<org.apache.pdfbox.cos.COSBase> arguments) throws IOException- Overrides:
processOperatorin classorg.apache.pdfbox.contentstream.PDFStreamEngine- Throws:
IOException
-
processImageOperation
protected void processImageOperation(List<org.apache.pdfbox.cos.COSBase> arguments) throws IOException
- Throws:
IOException
-
processTextPosition
protected void processTextPosition(org.apache.pdfbox.text.TextPosition text)
- Overrides:
processTextPositionin classorg.apache.pdfbox.text.PDFTextStripper
-
finishBox
protected void finishBox()
Finishes the current box - empties the text line buffer and creates a DOM element from it.
-
isReversed
protected boolean isReversed(byte directionality)
Checks whether the text directionality corresponds to reversed text (very rough)- Parameters:
directionality- the Character.directionality- Returns:
-
updateStyle
protected void updateStyle(BoxStyle bstyle, org.apache.pdfbox.text.TextPosition text)
Updates the text style according to a new text position- Parameters:
bstyle- the style to be updatedtext- the text position
-
getCurrentMediaBox
protected org.apache.pdfbox.pdmodel.common.PDRectangle getCurrentMediaBox()
Obtains the media box valid for the current page.- Returns:
- the media box rectangle
-
transformLength
protected float transformLength(float w)
Transforms a length according to the current transformation matrix.
-
transformPosition
protected float[] transformPosition(float x, float y)Transforms a position according to the current transformation matrix and current page transformation.- Parameters:
x-y-- Returns:
-
createCurrentPageTransformation
protected AffineTransform createCurrentPageTransformation()
-
intValue
protected int intValue(org.apache.pdfbox.cos.COSBase value)
Obtains a number from a PDF number value- Parameters:
value- the PDF value of the Integer or Fload type- Returns:
- the corresponging numeric value
-
floatValue
protected float floatValue(org.apache.pdfbox.cos.COSBase value)
Obtains a number from a PDF number value- Parameters:
value- the PDF value of the Integer or Float type- Returns:
- the corresponging numeric value
-
getLength
protected float getLength(org.apache.pdfbox.cos.COSBase value)
Obtains a length in points from a PDF number value- Parameters:
value- the PDF value of the Integer or Fload type- Returns:
- the resulting length in points
-
stringValue
protected String stringValue(org.apache.pdfbox.cos.COSBase value)
Obtains a string from a PDF value- Parameters:
value- the PDF value of the String, Integer or Float type- Returns:
- the corresponging string value
-
colorString
protected String colorString(int ir, int ig, int ib)
Creates a CSS rgb() specification from the color component values.- Parameters:
ir- red value (0..255)ig- green value (0..255)ib- blue value (0..255)- Returns:
- the rgb() string
-
colorString
protected String colorString(float r, float g, float b)
Creates a CSS rgb() specification from the color component values.- Parameters:
r- red value (0..1)g- green value (0..1)b- blue value (0..1)- Returns:
- the rgb() string
-
colorString
protected String colorString(org.apache.pdfbox.pdmodel.graphics.color.PDColor pdcolor)
Creates a CSS rgb specification from a PDF color- Parameters:
pdcolor-- Returns:
- the rgb() string
-
getTitle
protected String getTitle()
-
getTextDirectionality
protected byte getTextDirectionality(org.apache.pdfbox.text.TextPosition text)
-
getTextDirectionality
protected byte getTextDirectionality(String s)
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException- Overrides:
showGlyphin classorg.apache.pdfbox.contentstream.PDFStreamEngine- Throws:
IOException
-
computeFontHeight
protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException- Throws:
IOException
-
-