Class PDFBoxTree

  • Direct Known Subclasses:
    PDFDomTree

    public abstract class PDFBoxTree
    extends org.apache.pdfbox.text.PDFTextStripper
    A generic tree of boxes created from a PDF file. It processes the PDF document and calls the appropriate abstract methods in order to render a page, text box, etc. The particular implementations are expected to implement these actions in order to build the resulting document tree.
    Author:
    burgetr
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static String[] cssFontFamily
      Known font names that are recognized in the PDF files
      protected static String[] cssFontStyle
      Font styles corresponding to the font subtypes in pdFontType
      protected static String[] cssFontWeight
      Font weights corresponding to the font subtypes in pdFontType
      protected float cur_x
      Current text coordinates (the coordinates of the last encountered text box).
      protected float cur_y
      Current text coordinates (the coordinates of the last encountered text box).
      protected BoxStyle curstyle
      The style of the text line being created
      protected boolean disableGraphics
      When set to true, the graphics in the PDF file will be ignored.
      protected boolean disableImageData
      When set to true, the image data will not be transferred to the HTML data: url.
      protected boolean disableImages
      When set to true, the embedded images will be ignored.
      protected int endPage
      Last page to be processed
      protected FontTable fontTable
      Table of embedded fonts
      protected Vector<PathSegment> graphicsPath
      Current graphics path
      protected org.apache.pdfbox.text.TextPosition lastDia
      Last diacritic if any
      protected org.apache.pdfbox.text.TextPosition lastText
      Previous positioned text.
      protected float path_start_x
      Starting path construction position
      protected float path_start_y
      Starting path construction position
      protected float path_x
      Current path construction position
      protected float path_y
      Current path construction position
      protected static String[] pdFontType
      Known font subtypes recognized in PDF files
      protected org.apache.pdfbox.pdmodel.PDPage pdpage
      The PDF page currently being processed
      protected int startPage
      First page to be processed
      protected BoxStyle style
      The style of the future box being modified by the operators
      protected StringBuilder textLine
      The text box currently being created.
      protected TextMetrics textMetrics
      Current text line metrics
      static String UNIT
      Length units used in the generated CSS
      • Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

        document, charactersByArticle, LINE_SEPARATOR, output
    • Constructor Summary

      Constructors 
      Constructor Description
      PDFBoxTree()  
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      protected String colorString​(float r, float g, float b)
      Creates a CSS rgb() specification from the color component values.
      protected String colorString​(int ir, int ig, int ib)
      Creates a CSS rgb() specification from the color component values.
      protected String colorString​(org.apache.pdfbox.pdmodel.graphics.color.PDColor pdcolor)
      Creates a CSS rgb specification from a PDF color
      protected float computeFontHeight​(org.apache.pdfbox.pdmodel.font.PDFont arg0)  
      protected AffineTransform createCurrentPageTransformation()  
      protected void finishBox()
      Finishes the current box - empties the text line buffer and creates a DOM element from it.
      protected float floatValue​(org.apache.pdfbox.cos.COSBase value)
      Obtains a number from a PDF number value
      protected org.apache.pdfbox.pdmodel.common.PDRectangle getCurrentMediaBox()
      Obtains the media box valid for the current page.
      boolean getDisableGraphics()
      Checks whether the graphics processing is disabled.
      boolean getDisableImageData()
      Checks whether the copying of image data is disabled.
      boolean getDisableImages()
      Checks whether processing of embedded images is disabled.
      int getEndPage()  
      protected float getLength​(org.apache.pdfbox.cos.COSBase value)
      Obtains a length in points from a PDF number value
      int getStartPage()  
      protected byte getTextDirectionality​(String s)  
      protected byte getTextDirectionality​(org.apache.pdfbox.text.TextPosition text)  
      protected String getTitle()  
      protected int intValue​(org.apache.pdfbox.cos.COSBase value)
      Obtains a number from a PDF number value
      protected boolean isReversed​(byte directionality)
      Checks whether the text directionality corresponds to reversed text (very rough)
      protected void processImageOperation​(List<org.apache.pdfbox.cos.COSBase> arguments)  
      protected void processOperator​(org.apache.pdfbox.contentstream.operator.Operator operator, List<org.apache.pdfbox.cos.COSBase> arguments)  
      void processPage​(org.apache.pdfbox.pdmodel.PDPage page)  
      protected void processTextPosition​(org.apache.pdfbox.text.TextPosition text)  
      protected abstract void renderImage​(float x, float y, float width, float height, ImageResource data)
      Adds an image to the current page.
      protected abstract void renderPath​(List<PathSegment> path, boolean stroke, boolean fill)
      Adds a rectangle to the current page on the specified position.
      protected abstract void renderText​(String data, TextMetrics metrics)
      Creates a new text box in the current page.
      void setDisableGraphics​(boolean disableGraphics)
      Disables the processing of the graphic operators in the PDF files.
      void setDisableImageData​(boolean disableImageData)
      Disables the copying the image data to the resulting DOM tree.
      void setDisableImages​(boolean disableImages)
      Disables the processing of images contained in the PDF files.
      void setEndPage​(int endPage)  
      void setStartPage​(int startPage)  
      protected void showGlyph​(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)  
      protected abstract void startNewPage()
      Adds a new page to the resulting document and makes it a current (active) page.
      protected String stringValue​(org.apache.pdfbox.cos.COSBase value)
      Obtains a string from a PDF value
      protected float[] toRectangle​(List<PathSegment> path)  
      protected float transformLength​(float w)
      Transforms a length according to the current transformation matrix.
      protected float[] transformPosition​(float x, float y)
      Transforms a position according to the current transformation matrix and current page transformation.
      protected void updateFontTable()
      Updates the font table by adding new fonts used at the current page.
      protected void updateStyle​(BoxStyle bstyle, org.apache.pdfbox.text.TextPosition text)
      Updates the text style according to a new text position
      • Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

        endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCurrentPageNo, getDropThreshold, getEndBookmark, getCharactersByArticle, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator
      • Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

        addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
    • Field Detail

      • cssFontFamily

        protected static String[] cssFontFamily
        Known font names that are recognized in the PDF files
      • pdFontType

        protected static String[] pdFontType
        Known font subtypes recognized in PDF files
      • cssFontWeight

        protected static String[] cssFontWeight
        Font weights corresponding to the font subtypes in pdFontType
      • cssFontStyle

        protected static String[] cssFontStyle
        Font styles corresponding to the font subtypes in pdFontType
      • disableGraphics

        protected boolean disableGraphics
        When set to true, the graphics in the PDF file will be ignored.
      • disableImages

        protected boolean disableImages
        When set to true, the embedded images will be ignored.
      • disableImageData

        protected boolean disableImageData
        When set to true, the image data will not be transferred to the HTML data: url.
      • startPage

        protected int startPage
        First page to be processed
      • endPage

        protected int endPage
        Last page to be processed
      • fontTable

        protected FontTable fontTable
        Table of embedded fonts
      • pdpage

        protected org.apache.pdfbox.pdmodel.PDPage pdpage
        The PDF page currently being processed
      • cur_x

        protected float cur_x
        Current text coordinates (the coordinates of the last encountered text box).
      • cur_y

        protected float cur_y
        Current text coordinates (the coordinates of the last encountered text box).
      • path_x

        protected float path_x
        Current path construction position
      • path_y

        protected float path_y
        Current path construction position
      • path_start_x

        protected float path_start_x
        Starting path construction position
      • path_start_y

        protected float path_start_y
        Starting path construction position
      • lastText

        protected org.apache.pdfbox.text.TextPosition lastText
        Previous positioned text.
      • lastDia

        protected org.apache.pdfbox.text.TextPosition lastDia
        Last diacritic if any
      • textLine

        protected StringBuilder textLine
        The text box currently being created.
      • textMetrics

        protected TextMetrics textMetrics
        Current text line metrics
      • style

        protected BoxStyle style
        The style of the future box being modified by the operators
      • curstyle

        protected BoxStyle curstyle
        The style of the text line being created
    • Method Detail

      • processPage

        public void processPage​(org.apache.pdfbox.pdmodel.PDPage page)
                         throws IOException
        Overrides:
        processPage in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • getDisableGraphics

        public boolean getDisableGraphics()
        Checks whether the graphics processing is disabled.
        Returns:
        true when the graphics processing is disabled in the parser configuration.
      • setDisableGraphics

        public void setDisableGraphics​(boolean disableGraphics)
        Disables the processing of the graphic operators in the PDF files.
        Parameters:
        disableGraphics - when set to true the graphics is ignored in the source file.
      • getDisableImages

        public boolean getDisableImages()
        Checks whether processing of embedded images is disabled.
        Returns:
        true when the processing of embedded images is disabled in the parser configuration.
      • setDisableImages

        public void setDisableImages​(boolean disableImages)
        Disables the processing of images contained in the PDF files.
        Parameters:
        disableImages - when set to true the images are ignored in the source file.
      • getDisableImageData

        public boolean getDisableImageData()
        Checks whether the copying of image data is disabled.
        Returns:
        true when the copying of image data is disabled in the parser configuration.
      • setDisableImageData

        public void setDisableImageData​(boolean disableImageData)
        Disables the copying the image data to the resulting DOM tree.
        Parameters:
        disableImageData - when set to true the image data is not copied to the document tree. The eventual img elements will have an empty src attribute.
      • getStartPage

        public int getStartPage()
        Overrides:
        getStartPage in class org.apache.pdfbox.text.PDFTextStripper
      • setStartPage

        public void setStartPage​(int startPage)
        Overrides:
        setStartPage in class org.apache.pdfbox.text.PDFTextStripper
      • getEndPage

        public int getEndPage()
        Overrides:
        getEndPage in class org.apache.pdfbox.text.PDFTextStripper
      • setEndPage

        public void setEndPage​(int endPage)
        Overrides:
        setEndPage in class org.apache.pdfbox.text.PDFTextStripper
      • startNewPage

        protected abstract void startNewPage()
        Adds a new page to the resulting document and makes it a current (active) page.
      • renderText

        protected abstract void renderText​(String data,
                                           TextMetrics metrics)
        Creates a new text box in the current page. The style and position of the text are contained in the curstyle property.
        Parameters:
        data - The text contents.
      • renderPath

        protected abstract void renderPath​(List<PathSegment> path,
                                           boolean stroke,
                                           boolean fill)
                                    throws IOException
        Adds a rectangle to the current page on the specified position.
        Parameters:
        rect - the rectangle to be rendered
        stroke - should there be a stroke around?
        fill - should the rectangle be filled?
        Throws:
        IOException
      • renderImage

        protected abstract void renderImage​(float x,
                                            float y,
                                            float width,
                                            float height,
                                            ImageResource data)
                                     throws IOException
        Adds an image to the current page.
        Parameters:
        type - the image type: "png" or "jpeg"
        x - the X coordinate of the image
        y - the Y coordinate of the image
        width - the width coordinate of the image
        height - the height coordinate of the image
        data - the image data depending on the specified type
        Throws:
        IOException
      • updateFontTable

        protected void updateFontTable()
        Updates the font table by adding new fonts used at the current page.
      • processOperator

        protected void processOperator​(org.apache.pdfbox.contentstream.operator.Operator operator,
                                       List<org.apache.pdfbox.cos.COSBase> arguments)
                                throws IOException
        Overrides:
        processOperator in class org.apache.pdfbox.contentstream.PDFStreamEngine
        Throws:
        IOException
      • processImageOperation

        protected void processImageOperation​(List<org.apache.pdfbox.cos.COSBase> arguments)
                                      throws IOException
        Throws:
        IOException
      • processTextPosition

        protected void processTextPosition​(org.apache.pdfbox.text.TextPosition text)
        Overrides:
        processTextPosition in class org.apache.pdfbox.text.PDFTextStripper
      • finishBox

        protected void finishBox()
        Finishes the current box - empties the text line buffer and creates a DOM element from it.
      • isReversed

        protected boolean isReversed​(byte directionality)
        Checks whether the text directionality corresponds to reversed text (very rough)
        Parameters:
        directionality - the Character.directionality
        Returns:
      • updateStyle

        protected void updateStyle​(BoxStyle bstyle,
                                   org.apache.pdfbox.text.TextPosition text)
        Updates the text style according to a new text position
        Parameters:
        bstyle - the style to be updated
        text - the text position
      • getCurrentMediaBox

        protected org.apache.pdfbox.pdmodel.common.PDRectangle getCurrentMediaBox()
        Obtains the media box valid for the current page.
        Returns:
        the media box rectangle
      • transformLength

        protected float transformLength​(float w)
        Transforms a length according to the current transformation matrix.
      • transformPosition

        protected float[] transformPosition​(float x,
                                            float y)
        Transforms a position according to the current transformation matrix and current page transformation.
        Parameters:
        x -
        y -
        Returns:
      • createCurrentPageTransformation

        protected AffineTransform createCurrentPageTransformation()
      • intValue

        protected int intValue​(org.apache.pdfbox.cos.COSBase value)
        Obtains a number from a PDF number value
        Parameters:
        value - the PDF value of the Integer or Fload type
        Returns:
        the corresponging numeric value
      • floatValue

        protected float floatValue​(org.apache.pdfbox.cos.COSBase value)
        Obtains a number from a PDF number value
        Parameters:
        value - the PDF value of the Integer or Float type
        Returns:
        the corresponging numeric value
      • getLength

        protected float getLength​(org.apache.pdfbox.cos.COSBase value)
        Obtains a length in points from a PDF number value
        Parameters:
        value - the PDF value of the Integer or Fload type
        Returns:
        the resulting length in points
      • stringValue

        protected String stringValue​(org.apache.pdfbox.cos.COSBase value)
        Obtains a string from a PDF value
        Parameters:
        value - the PDF value of the String, Integer or Float type
        Returns:
        the corresponging string value
      • colorString

        protected String colorString​(int ir,
                                     int ig,
                                     int ib)
        Creates a CSS rgb() specification from the color component values.
        Parameters:
        ir - red value (0..255)
        ig - green value (0..255)
        ib - blue value (0..255)
        Returns:
        the rgb() string
      • colorString

        protected String colorString​(float r,
                                     float g,
                                     float b)
        Creates a CSS rgb() specification from the color component values.
        Parameters:
        r - red value (0..1)
        g - green value (0..1)
        b - blue value (0..1)
        Returns:
        the rgb() string
      • colorString

        protected String colorString​(org.apache.pdfbox.pdmodel.graphics.color.PDColor pdcolor)
        Creates a CSS rgb specification from a PDF color
        Parameters:
        pdcolor -
        Returns:
        the rgb() string
      • getTitle

        protected String getTitle()
      • getTextDirectionality

        protected byte getTextDirectionality​(org.apache.pdfbox.text.TextPosition text)
      • getTextDirectionality

        protected byte getTextDirectionality​(String s)
      • showGlyph

        protected void showGlyph​(org.apache.pdfbox.util.Matrix arg0,
                                 org.apache.pdfbox.pdmodel.font.PDFont arg1,
                                 int arg2,
                                 String arg3,
                                 org.apache.pdfbox.util.Vector arg4)
                          throws IOException
        Overrides:
        showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
        Throws:
        IOException
      • computeFontHeight

        protected float computeFontHeight​(org.apache.pdfbox.pdmodel.font.PDFont arg0)
                                   throws IOException
        Throws:
        IOException