Package com.tom_roush.pdfbox.text
Class PDFMarkedContentExtractor
- java.lang.Object
-
- com.tom_roush.pdfbox.contentstream.PDFStreamEngine
-
- com.tom_roush.pdfbox.text.PDFMarkedContentExtractor
-
public class PDFMarkedContentExtractor extends PDFStreamEngine
This is an stream engine to extract the marked content of a pdf.
-
-
Constructor Summary
Constructors Constructor Description PDFMarkedContentExtractor()Instantiate a new PDFTextStripper object.PDFMarkedContentExtractor(String encoding)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidbeginMarkedContentSequence(COSName tag, COSDictionary properties)voidendMarkedContentSequence()List<PDMarkedContent>getMarkedContents()voidprocessPage(PDPage page)This will initialise and process the contents of the stream.protected voidprocessTextPosition(TextPosition text)This will process a TextPosition object and add the text to the list of characters on a page.protected voidshowGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)This method was originally written by Ben Litchfield for PDFStreamEngine.protected voidshowText(byte[] string)Process text from the PDF Stream.voidxobject(PDXObject xobject)-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from class com.tom_roush.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getResources, getTextLineMatrix, getTextMatrix, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Constructor Detail
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor() throws IOExceptionInstantiate a new PDFTextStripper object.- Throws:
IOException
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor(String encoding) throws IOException
Constructor. Will apply encoding-specific conversions to the output text.- Parameters:
encoding- The encoding that the output will be written in.- Throws:
IOException
-
-
Method Detail
-
beginMarkedContentSequence
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
-
endMarkedContentSequence
public void endMarkedContentSequence()
-
xobject
public void xobject(PDXObject xobject)
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Parameters:
text- The text to process.
-
getMarkedContents
public List<PDMarkedContent> getMarkedContents()
-
processPage
public void processPage(PDPage page) throws IOException
This will initialise and process the contents of the stream.- Overrides:
processPagein classPDFStreamEngine- Parameters:
page- the page to process- Throws:
IOException- if there is an error accessing the stream.
-
showText
protected void showText(byte[] string) throws IOExceptionDescription copied from class:PDFStreamEngineProcess text from the PDF Stream. You should override this method if you want to perform an action when encoded text is being processed.- Overrides:
showTextin classPDFStreamEngine- Parameters:
string- the encoded text- Throws:
IOException- if there is an error processing the string
-
showGlyph
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
This method was originally written by Ben Litchfield for PDFStreamEngine.- Overrides:
showGlyphin classPDFStreamEngine- Parameters:
textRenderingMatrix- the current text rendering matrix, Trmfont- the current fontcode- internal PDF character code for the glyphunicode- the Unicode text for this glyph, or null if the PDF does provide itdisplacement- the displacement (i.e. advance) of the glyph in text space- Throws:
IOException- if the glyph cannot be processed
-
-