Package org.fit.pdfdom
Class PDFDomTree
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.fit.pdfdom.PDFBoxTree
-
- org.fit.pdfdom.PDFDomTree
-
public class PDFDomTree extends PDFBoxTree
A DOM representation of a PDF file.- Author:
- burgetr
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected classPDFDomTree.HtmlDivLineMaps input line to an HTML div rectangle, since HTML does not support standard lines
-
Field Summary
Fields Modifier and Type Field Description protected ElementbodyThe body element of the resulting document.protected PDFDomTreeConfigconfigprotected ElementcurpageThe element representing the page currently being created in the resulting document.protected StringdefaultStyleDefault style placed in the begining of the resulting documentprotected DocumentdocThe resulting document representing the PDF file.protected ElementglobalStyleThe global style element of the resulting document.protected ElementheadThe head element of the resulting document.protected intpagecntPage counter for assigning IDs to the pages.protected inttextcntText element counter for assigning IDs to the text elements.protected ElementtitleThe title element of the resulting document.-
Fields inherited from class org.fit.pdfdom.PDFBoxTree
cssFontFamily, cssFontStyle, cssFontWeight, cur_x, cur_y, curstyle, disableGraphics, disableImageData, disableImages, endPage, fontTable, graphicsPath, lastDia, lastText, path_start_x, path_start_y, path_x, path_y, pdFontType, pdpage, startPage, style, textLine, textMetrics, UNIT
-
-
Constructor Summary
Constructors Constructor Description PDFDomTree()Creates a new PDF DOM parser.PDFDomTree(PDFDomTreeConfig config)Creates a new PDF DOM parser.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)protected voidcreateDocument()Creates a new empty HTML document tree.DocumentcreateDOM(org.apache.pdfbox.pdmodel.PDDocument doc)Loads a PDF document and creates a DOM tree from it.protected StringcreateFontFaces()protected StringcreateGlobalStyle()Generate the global CSS style for the whole document.protected ElementcreateImageElement(float x, float y, float width, float height, ImageResource resource)Creates an element that represents an image drawn at the specified coordinates in the page.protected ElementcreateLineElement(float x1, float y1, float x2, float y2)Create an element that represents a horizntal or vertical line.protected ElementcreatePageElement()Creates an element that represents a single page.protected ElementcreatePathImage(List<PathSegment> path)protected ElementcreateRectangleElement(float x, float y, float width, float height, boolean stroke, boolean fill)Creates an element that represents a rectangle drawn at the specified coordinates in the page.protected ElementcreateTextElement(float width)Creates an element that represents a single positioned box with no content.protected ElementcreateTextElement(String data, float width)Creates an element that represents a single positioned box containing the specified text string.protected voidendDocument(org.apache.pdfbox.pdmodel.PDDocument document)DocumentgetDocument()Obtains the resulting document tree.protected voidrenderImage(float x, float y, float width, float height, ImageResource resource)Adds an image to the current page.protected voidrenderPath(List<PathSegment> path, boolean stroke, boolean fill)Adds a rectangle to the current page on the specified position.protected voidrenderText(String data, TextMetrics metrics)Creates a new text box in the current page.protected voidshowGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)voidstartDocument(org.apache.pdfbox.pdmodel.PDDocument document)protected voidstartNewPage()Adds a new page to the resulting document and makes it a current (active) page.protected voidupdateFontTable()Updates the font table by adding new fonts used at the current page.voidwriteText(org.apache.pdfbox.pdmodel.PDDocument doc, Writer outputStream)Parses a PDF document and serializes the resulting DOM tree to an output.-
Methods inherited from class org.fit.pdfdom.PDFBoxTree
colorString, colorString, colorString, createCurrentPageTransformation, finishBox, floatValue, getCurrentMediaBox, getDisableGraphics, getDisableImageData, getDisableImages, getEndPage, getLength, getStartPage, getTextDirectionality, getTextDirectionality, getTitle, intValue, isReversed, processImageOperation, processOperator, processPage, processTextPosition, setDisableGraphics, setDisableImageData, setDisableImages, setEndPage, setStartPage, stringValue, toRectangle, transformLength, transformPosition, updateStyle
-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endArticle, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCurrentPageNo, getDropThreshold, getEndBookmark, getCharactersByArticle, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeWordSeparator
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
defaultStyle
protected String defaultStyle
Default style placed in the begining of the resulting document
-
doc
protected Document doc
The resulting document representing the PDF file.
-
head
protected Element head
The head element of the resulting document.
-
body
protected Element body
The body element of the resulting document.
-
title
protected Element title
The title element of the resulting document.
-
globalStyle
protected Element globalStyle
The global style element of the resulting document.
-
curpage
protected Element curpage
The element representing the page currently being created in the resulting document.
-
textcnt
protected int textcnt
Text element counter for assigning IDs to the text elements.
-
pagecnt
protected int pagecnt
Page counter for assigning IDs to the pages.
-
config
protected PDFDomTreeConfig config
-
-
Constructor Detail
-
PDFDomTree
public PDFDomTree() throws IOExceptionCreates a new PDF DOM parser.- Throws:
IOException
-
PDFDomTree
public PDFDomTree(PDFDomTreeConfig config) throws IOException
Creates a new PDF DOM parser.- Throws:
IOException
-
-
Method Detail
-
createDocument
protected void createDocument() throws ParserConfigurationExceptionCreates a new empty HTML document tree.- Throws:
ParserConfigurationException
-
getDocument
public Document getDocument()
Obtains the resulting document tree.- Returns:
- The DOM root element.
-
startDocument
public void startDocument(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException- Overrides:
startDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
endDocument
protected void endDocument(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException- Overrides:
endDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
writeText
public void writeText(org.apache.pdfbox.pdmodel.PDDocument doc, Writer outputStream) throws IOExceptionParses a PDF document and serializes the resulting DOM tree to an output. This requires a DOM Level 3 capable implementation to be available.- Overrides:
writeTextin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
createDOM
public Document createDOM(org.apache.pdfbox.pdmodel.PDDocument doc) throws IOException
Loads a PDF document and creates a DOM tree from it.- Parameters:
doc- the source document- Returns:
- a DOM Document representing the DOM tree
- Throws:
IOException
-
startNewPage
protected void startNewPage()
Description copied from class:PDFBoxTreeAdds a new page to the resulting document and makes it a current (active) page.- Specified by:
startNewPagein classPDFBoxTree
-
renderText
protected void renderText(String data, TextMetrics metrics)
Description copied from class:PDFBoxTreeCreates a new text box in the current page. The style and position of the text are contained in thePDFBoxTree.curstyleproperty.- Specified by:
renderTextin classPDFBoxTree- Parameters:
data- The text contents.
-
renderPath
protected void renderPath(List<PathSegment> path, boolean stroke, boolean fill) throws IOException
Description copied from class:PDFBoxTreeAdds a rectangle to the current page on the specified position.- Specified by:
renderPathin classPDFBoxTreestroke- should there be a stroke around?fill- should the rectangle be filled?- Throws:
IOException
-
renderImage
protected void renderImage(float x, float y, float width, float height, ImageResource resource) throws IOExceptionDescription copied from class:PDFBoxTreeAdds an image to the current page.- Specified by:
renderImagein classPDFBoxTree- Parameters:
x- the X coordinate of the imagey- the Y coordinate of the imagewidth- the width coordinate of the imageheight- the height coordinate of the imageresource- the image data depending on the specified type- Throws:
IOException
-
createPageElement
protected Element createPageElement()
Creates an element that represents a single page.- Returns:
- the resulting DOM element
-
createTextElement
protected Element createTextElement(float width)
Creates an element that represents a single positioned box with no content.- Returns:
- the resulting DOM element
-
createTextElement
protected Element createTextElement(String data, float width)
Creates an element that represents a single positioned box containing the specified text string.- Parameters:
data- the text string to be contained in the created box.- Returns:
- the resulting DOM element
-
createRectangleElement
protected Element createRectangleElement(float x, float y, float width, float height, boolean stroke, boolean fill)
Creates an element that represents a rectangle drawn at the specified coordinates in the page.- Parameters:
x- the X coordinate of the rectangley- the Y coordinate of the rectanglewidth- the width of the rectangleheight- the height of the rectanglestroke- should there be a stroke around?fill- should the rectangle be filled?- Returns:
- the resulting DOM element
-
createLineElement
protected Element createLineElement(float x1, float y1, float x2, float y2)
Create an element that represents a horizntal or vertical line.- Parameters:
x1-y1-x2-y2-- Returns:
- the created DOM element
-
createPathImage
protected Element createPathImage(List<PathSegment> path) throws IOException
- Throws:
IOException
-
createImageElement
protected Element createImageElement(float x, float y, float width, float height, ImageResource resource) throws IOException
Creates an element that represents an image drawn at the specified coordinates in the page.- Parameters:
x- the X coordinate of the imagey- the Y coordinate of the imagewidth- the width coordinate of the imageheight- the height coordinate of the imagetype- the image type:"png"or"jpeg"resource- the image data depending on the specified type- Returns:
- Throws:
IOException
-
createGlobalStyle
protected String createGlobalStyle()
Generate the global CSS style for the whole document.- Returns:
- the CSS code used in the generated document header
-
updateFontTable
protected void updateFontTable()
Description copied from class:PDFBoxTreeUpdates the font table by adding new fonts used at the current page.- Overrides:
updateFontTablein classPDFBoxTree
-
createFontFaces
protected String createFontFaces()
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException- Overrides:
showGlyphin classorg.apache.pdfbox.contentstream.PDFStreamEngine- Throws:
IOException
-
computeFontHeight
protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException- Throws:
IOException
-
-