java.lang.Object
com.lowagie.text.pdf.parser.PdfTextExtractor
Extracts text from a PDF file.
- Since:
- 2.1.4
-
Constructor Summary
ConstructorsConstructorDescriptionPdfTextExtractor(PdfReader reader) Creates a new Text Extractor object, using aTextAssembleras the render listenerPdfTextExtractor(PdfReader reader, boolean usePdfMarkupElements) Creates a new Text Extractor object, using aTextAssembleras the render listenerPdfTextExtractor(PdfReader reader, TextAssembler renderListener) Creates a new Text Extractor object. -
Method Summary
Modifier and TypeMethodDescriptiongetTextFromPage(int page) Gets the text from a page.getTextFromPage(int page, boolean useContainerMarkup) get the text from the pagevoidprocessContent(byte[] contentBytes, PdfDictionary resources, PdfContentStreamHandler handler) Processes PDF syntax
-
Constructor Details
-
PdfTextExtractor
Creates a new Text Extractor object, using aTextAssembleras the render listener- Parameters:
reader- the reader with the PDF
-
PdfTextExtractor
Creates a new Text Extractor object, using aTextAssembleras the render listener- Parameters:
reader- the reader with the PDFusePdfMarkupElements- should we use higher level tags for PDF markup entities?
-
PdfTextExtractor
Creates a new Text Extractor object.- Parameters:
reader- the reader with the PDFrenderListener- the render listener that will be used to analyze renderText operations and provide resultant text
-
-
Method Details
-
getTextFromPage
Gets the text from a page.- Parameters:
page- the 1-based page number of page- Returns:
- a String with the content as plain text (without PDF syntax)
- Throws:
IOException- on error
-
getTextFromPage
get the text from the page- Parameters:
page- page number we are interested inuseContainerMarkup- should we put tags in for PDf markup container elements (not really HTML at the moment).- Returns:
- result of extracting the text, with tags as requested.
- Throws:
IOException- on error
-
processContent
public void processContent(byte[] contentBytes, PdfDictionary resources, PdfContentStreamHandler handler) Processes PDF syntax- Parameters:
contentBytes- the bytes of a content streamresources- the resources that come with the content streamhandler- interprets events caused by recognition of operations in a content stream.
-