Package com.tom_roush.pdfbox.text
Class PDFTextStripperByArea
- java.lang.Object
-
- com.tom_roush.pdfbox.contentstream.PDFStreamEngine
-
- com.tom_roush.pdfbox.text.PDFTextStripper
-
- com.tom_roush.pdfbox.text.PDFTextStripperByArea
-
public class PDFTextStripperByArea extends PDFTextStripper
This will extract text from a specified region in the PDF.
-
-
Field Summary
-
Fields inherited from class com.tom_roush.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
-
Constructor Summary
Constructors Constructor Description PDFTextStripperByArea()Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddRegion(String regionName, RectF rect)Add a new region to group text by.protected floatcomputeFontHeight(PDFont font)Compute the font height.voidextractRegions(PDPage page)Process the page to extract the region text.List<String>getRegions()Get the list of regions that have been setup.StringgetTextForRegion(String regionName)Get the text for the region, this should be called after extractRegions().protected voidprocessTextPosition(TextPosition text)This will process a TextPosition object and add the text to the list of characters on a page.voidremoveRegion(String regionName)Delete a region to group text by.voidsetShouldSeparateByBeads(boolean aShouldSeparateByBeads)This method does nothing in this derived class, because beads and regions are incompatible.protected voidshowGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)Called when a glyph is to be processed.protected voidwritePage()This will print the processed page text to the output stream.-
Methods inherited from class com.tom_roush.pdfbox.text.PDFTextStripper
endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator
-
Methods inherited from class com.tom_roush.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Constructor Detail
-
PDFTextStripperByArea
public PDFTextStripperByArea() throws IOExceptionConstructor.- Throws:
IOException- If there is an error loading properties.
-
-
Method Detail
-
setShouldSeparateByBeads
public final void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.- Overrides:
setShouldSeparateByBeadsin classPDFTextStripper- Parameters:
aShouldSeparateByBeads- The new grouping of beads.
-
addRegion
public void addRegion(String regionName, RectF rect)
Add a new region to group text by.- Parameters:
regionName- The name of the region.rect- The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
-
removeRegion
public void removeRegion(String regionName)
Delete a region to group text by. If the region does not exist, this method does nothing.- Parameters:
regionName- The name of the region to delete.
-
getRegions
public List<String> getRegions()
Get the list of regions that have been setup.- Returns:
- A list of java.lang.String objects to identify the region names.
-
getTextForRegion
public String getTextForRegion(String regionName)
Get the text for the region, this should be called after extractRegions().- Parameters:
regionName- The name of the region to get the text from.- Returns:
- The text that was identified in that region.
-
extractRegions
public void extractRegions(PDPage page) throws IOException
Process the page to extract the region text.- Parameters:
page- The page to extract the regions from.- Throws:
IOException- If there is an error while extracting text.
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPositionin classPDFTextStripper- Parameters:
text- The text to process.
-
writePage
protected void writePage() throws IOExceptionThis will print the processed page text to the output stream.- Overrides:
writePagein classPDFTextStripper- Throws:
IOException- If there is an error writing the text.
-
showGlyph
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
Called when a glyph is to be processed. The heuristic calculations here were originally written by Ben Litchfield for PDFStreamEngine.- Overrides:
showGlyphin classPDFStreamEngine- Parameters:
textRenderingMatrix- the current text rendering matrix, Trmfont- the current fontcode- internal PDF character code for the glyphunicode- the Unicode text for this glyph, or null if the PDF does provide itdisplacement- the displacement (i.e. advance) of the glyph in text space- Throws:
IOException- if the glyph cannot be processed
-
computeFontHeight
protected float computeFontHeight(PDFont font) throws IOException
Compute the font height. Override this if you want to use own calculations.- Parameters:
font- the font.- Returns:
- the font height.
- Throws:
IOException- if there is an error while getting the font bounding box.
-
-