Class PDFMarkedContentExtractor


public class PDFMarkedContentExtractor extends PDFTextStreamEngine
This is an stream engine to extract the marked content of a pdf.
Author:
Johannes Koch
  • Constructor Details

    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor() throws IOException
      Instantiate a new PDFTextStripper object.
      Throws:
      IOException
    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor(String encoding) throws IOException
      Constructor. Will apply encoding-specific conversions to the output text.
      Parameters:
      encoding - The encoding that the output will be written in.
      Throws:
      IOException
  • Method Details

    • isSuppressDuplicateOverlappingText

      public boolean isSuppressDuplicateOverlappingText()
      Returns:
      the suppressDuplicateOverlappingText setting.
    • setSuppressDuplicateOverlappingText

      public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
      By default the class will attempt to remove text that overlaps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
      Parameters:
      suppressDuplicateOverlappingText - The suppressDuplicateOverlappingText setting to set.
    • beginMarkedContentSequence

      public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
      Description copied from class: PDFStreamEngine
      Called when a marked content group begins
      Overrides:
      beginMarkedContentSequence in class PDFStreamEngine
      Parameters:
      tag - content tag
      properties - optional properties
    • endMarkedContentSequence

      public void endMarkedContentSequence()
      Description copied from class: PDFStreamEngine
      Called when a a marked content group ends
      Overrides:
      endMarkedContentSequence in class PDFStreamEngine
    • xobject

      public void xobject(PDXObject xobject)
    • processTextPosition

      protected void processTextPosition(TextPosition text)
      This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
      Overrides:
      processTextPosition in class PDFTextStreamEngine
      Parameters:
      text - The text to process.
    • getMarkedContents

      public List<PDMarkedContent> getMarkedContents()