Class PDFMarkedContent2XHTML

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.tika.parser.pdf.PDFMarkedContent2XHTML

public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripper

This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.

Since:
1.24
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final String
     
    static final String
     
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    we need to override this because we are overriding processPages(PDPageTree)
    int
     
    static void
    process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)
    Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
    void
    processPage(org.apache.pdfbox.pdmodel.PDPage page)
     
    void
    setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
     
    void
    setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
     
    void
    setStartPage(int startPage)
     

    Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

    getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText

    Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

    addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Method Details

    • process

      public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaException
      Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
      Parameters:
      pdDocument - PDF document
      handler - SAX content handler
      metadata - PDF metadata
      Throws:
      SAXException - if the content handler fails to process SAX events
      TikaException - if there was an exception outside of per page processing
    • processPage

      public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
      Overrides:
      processPage in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • getCurrentPageNo

      public int getCurrentPageNo()
      we need to override this because we are overriding processPages(PDPageTree)
      Returns:
    • setStartBookmark

      public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
      Overrides:
      setStartBookmark in class org.apache.pdfbox.text.PDFTextStripper
    • setEndBookmark

      public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
      Overrides:
      setEndBookmark in class org.apache.pdfbox.text.PDFTextStripper
    • setStartPage

      public void setStartPage(int startPage)
      Overrides:
      setStartPage in class org.apache.pdfbox.text.PDFTextStripper
    • getStartPage

      public int getStartPage()
      Overrides:
      getStartPage in class org.apache.pdfbox.text.PDFTextStripper