java.lang.Object
org.sejda.sambox.contentstream.PDFStreamEngine
org.sejda.sambox.text.PDFTextStreamEngine
org.sejda.sambox.text.PDFMarkedContentExtractor
This is an stream engine to extract the marked content of a pdf.
- Author:
- Johannes Koch
-
Constructor Summary
ConstructorsConstructorDescriptionInstantiate a new PDFTextStripper object.PDFMarkedContentExtractor(String encoding) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionvoidbeginMarkedContentSequence(COSName tag, COSDictionary properties) Called when a marked content group beginsvoidCalled when a a marked content group endsbooleanprotected voidThis will process a TextPosition object and add the text to the list of characters on a page.voidsetSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText) By default the class will attempt to remove text that overlaps each other.voidMethods inherited from class org.sejda.sambox.text.PDFTextStreamEngine
computeFontHeight, processPage, showGlyphMethods inherited from class org.sejda.sambox.contentstream.PDFStreamEngine
addOperator, addOperatorIfAbsent, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processStream, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Constructor Details
-
PDFMarkedContentExtractor
Instantiate a new PDFTextStripper object.- Throws:
IOException
-
PDFMarkedContentExtractor
Constructor. Will apply encoding-specific conversions to the output text.- Parameters:
encoding- The encoding that the output will be written in.- Throws:
IOException
-
-
Method Details
-
isSuppressDuplicateOverlappingText
public boolean isSuppressDuplicateOverlappingText()- Returns:
- the suppressDuplicateOverlappingText setting.
-
setSuppressDuplicateOverlappingText
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText) By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.- Parameters:
suppressDuplicateOverlappingText- The suppressDuplicateOverlappingText setting to set.
-
beginMarkedContentSequence
Description copied from class:PDFStreamEngineCalled when a marked content group begins- Overrides:
beginMarkedContentSequencein classPDFStreamEngine- Parameters:
tag- content tagproperties- optional properties
-
endMarkedContentSequence
public void endMarkedContentSequence()Description copied from class:PDFStreamEngineCalled when a a marked content group ends- Overrides:
endMarkedContentSequencein classPDFStreamEngine
-
xobject
-
processTextPosition
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPositionin classPDFTextStreamEngine- Parameters:
text- The text to process.
-
getMarkedContents
-