Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionintwe need to override this because we are overridingprocessPages(PDPageTree)intstatic voidprocess(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page) voidsetEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) voidsetStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) voidsetStartPage(int startPage) Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeTextMethods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
-
Field Details
-
XMP_DOCUMENT_CATALOG_LOCATION
- See Also:
-
XMP_PAGE_LOCATION_PREFIX
- See Also:
-
-
Method Details
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaException Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument- PDF documenthandler- SAX content handlermetadata- PDF metadata- Throws:
SAXException- if the content handler fails to process SAX eventsTikaException- if there was an exception outside of per page processing
-
processPage
- Overrides:
processPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
getCurrentPageNo
public int getCurrentPageNo()we need to override this because we are overridingprocessPages(PDPageTree)- Returns:
-
setStartPage
public void setStartPage(int startPage) - Overrides:
setStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-
getStartPage
public int getStartPage()- Overrides:
getStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-