Class PDFParserConfig

java.lang.Object
org.apache.tika.parser.pdf.PDFParserConfig
All Implemented Interfaces:
Serializable

public class PDFParserConfig extends Object implements Serializable
Config for PDFParser.

This allows parameters to be set programmatically:

  1. Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)
  2. Constructor of PDFParser
  3. Passing to PDFParser through a ParseContext: context.set(PDFParserConfig.class, config);

Parameters can also be set by modifying the PDFParserConfig.properties file, which lives in the expected places, in trunk: tika-parsers/src/main/resources/org/apache/tika/parser/pdf

Or, in tika-app-x.x.jar or tika-parsers-x.x.jar: org/apache/tika/parser/pdf

See Also:
  • Constructor Details

    • PDFParserConfig

      public PDFParserConfig()
    • PDFParserConfig

      public PDFParserConfig(InputStream is)
      Loads properties from InputStream and then tries to close InputStream. If there is an IOException, this silently swallows the exception and goes back to the default.
      Parameters:
      is -
  • Method Details

    • setExtractMarkedContent

      public void setExtractMarkedContent(boolean extractMarkedContent)
      If the PDF contains marked content, try to extract text and its marked structure. If the PDF does not contain marked content, backoff to the regular PDF2XHTML for text extraction. As of 1.24, this is an "alpha" version.
      Parameters:
      extractMarkedContent -
      Since:
      1.24
    • getExtractMarkedContent

      public boolean getExtractMarkedContent()
    • configure

      public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
      Configures the given pdf2XHTML.
      Parameters:
      pdf2XHTML -
    • getExtractAcroFormContent

      public boolean getExtractAcroFormContent()
      See Also:
    • setExtractAcroFormContent

      public void setExtractAcroFormContent(boolean extractAcroFormContent)
      If true (the default), extract content from AcroForms at the end of the document. If an XFA is found, try to process that, otherwise, process the AcroForm.
      Parameters:
      extractAcroFormContent -
    • getIfXFAExtractOnlyXFA

      public boolean getIfXFAExtractOnlyXFA()
      Returns:
      how to handle XFA data if it exists
      See Also:
    • setIfXFAExtractOnlyXFA

      public void setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
      If false (the default), extract content from the full PDF as well as the XFA form. This will likely lead to some duplicative content.
      Parameters:
      ifXFAExtractOnlyXFA -
    • getExtractBookmarksText

      public boolean getExtractBookmarksText()
      See Also:
    • setExtractBookmarksText

      public void setExtractBookmarksText(boolean extractBookmarksText)
      If true, extract bookmarks (document outline) text.

      Te default is true

      Parameters:
      extractBookmarksText -
    • setExtractFontNames

      public void setExtractFontNames(boolean extractFontNames)
      Extract font names into a metadata field
      Parameters:
      extractFontNames -
    • getExtractFontNames

      public boolean getExtractFontNames()
    • getExtractInlineImages

      public boolean getExtractInlineImages()
      See Also:
    • setExtractInlineImages

      public void setExtractInlineImages(boolean extractInlineImages)
      If true, extract inline embedded OBXImages. Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to true with caution.

      The default is false.

      Parameters:
      extractInlineImages -
      See Also:
    • getExtractUniqueInlineImagesOnly

      public boolean getExtractUniqueInlineImagesOnly()
      See Also:
    • setExtractUniqueInlineImagesOnly

      public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
      Multiple pages within a PDF file might refer to the same underlying image. If extractUniqueInlineImagesOnly is set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true. The default is true.

      Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted.

      For this parameter to have any effect, extractInlineImages must be set to true.

      Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.

      Parameters:
      extractUniqueInlineImagesOnly -
    • getEnableAutoSpace

      public boolean getEnableAutoSpace()
      See Also:
    • setEnableAutoSpace

      public void setEnableAutoSpace(boolean enableAutoSpace)
      If true (the default), the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters.
    • getSuppressDuplicateOverlappingText

      public boolean getSuppressDuplicateOverlappingText()
      See Also:
    • setSuppressDuplicateOverlappingText

      public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
      If true, the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155). By default this is disabled.
    • getExtractAnnotationText

      public boolean getExtractAnnotationText()
      See Also:
    • setExtractAnnotationText

      public void setExtractAnnotationText(boolean extractAnnotationText)
      If true (the default), text in annotations will be extracted.
    • getSortByPosition

      public boolean getSortByPosition()
      See Also:
    • setSortByPosition

      public void setSortByPosition(boolean sortByPosition)
      If true, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). Default is false.
    • getAverageCharTolerance

      public Float getAverageCharTolerance()
      See Also:
    • setAverageCharTolerance

      public void setAverageCharTolerance(Float averageCharTolerance)
      See PDFTextStripper.setAverageCharTolerance(float)
    • getSpacingTolerance

      public Float getSpacingTolerance()
      See Also:
    • setSpacingTolerance

      public void setSpacingTolerance(Float spacingTolerance)
      See PDFTextStripper.setSpacingTolerance(float)
    • getDropThreshold

      public Float getDropThreshold()
    • setDropThreshold

      public void setDropThreshold(float dropThreshold)
    • getAccessChecker

      public AccessChecker getAccessChecker()
    • setAccessChecker

      public void setAccessChecker(AccessChecker accessChecker)
    • isCatchIntermediateIOExceptions

      public boolean isCatchIntermediateIOExceptions()
      Returns:
      whether or not to catch IOExceptions
    • getCatchIntermediateIOExceptions

      public boolean getCatchIntermediateIOExceptions()
      Returns:
      whether or not to catch IOExceptions
    • setCatchIntermediateIOExceptions

      public void setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)
      The PDFBox parser will throw an IOException if there is a problem with a stream. If this is set to true, Tika's PDFParser will catch these exceptions and try to parse the rest of the document. After the parse is completed, Tika's PDFParser will throw the first caught exception.
      Parameters:
      catchIntermediateIOExceptions -
    • setOcrStrategy

      public void setOcrStrategy(PDFParserConfig.OCR_STRATEGY ocrStrategy)
      Which strategy to use for OCR
      Parameters:
      ocrStrategy -
    • setOcrStrategy

      public void setOcrStrategy(String ocrStrategyString)
      Which strategy to use for OCR
      Parameters:
      ocrStrategyString -
    • getOcrStrategy

      public PDFParserConfig.OCR_STRATEGY getOcrStrategy()
      Returns:
      strategy to use for OCR
    • getOcrImageFormatName

      public String getOcrImageFormatName()
      String representation of the image format used to render the page image for OCR (examples: png, tiff, jpeg)
      Returns:
    • setOcrImageFormatName

      public void setOcrImageFormatName(String ocrImageFormatName)
      Parameters:
      ocrImageFormatName - name of image format used to render page image
      See Also:
    • getOcrImageType

      public org.apache.pdfbox.rendering.ImageType getOcrImageType()
      Image type used to render the page image for OCR.
      Returns:
      image type
      See Also:
    • setOcrImageType

      public void setOcrImageType(org.apache.pdfbox.rendering.ImageType ocrImageType)
      Image type used to render the page image for OCR.
      Parameters:
      ocrImageType -
    • setOcrImageType

      public void setOcrImageType(String ocrImageTypeString)
      Image type used to render the page image for OCR.
      See Also:
    • getOcrDPI

      public int getOcrDPI()
      Dots per inch used to render the page image for OCR
      Returns:
      dots per inch
    • setOcrDPI

      public void setOcrDPI(int ocrDPI)
      Dots per inch used to render the page image for OCR. This does not apply to all image formats.
      Parameters:
      ocrDPI -
    • getOcrImageQuality

      public float getOcrImageQuality()
      Image quality used to render the page image for OCR. This does not apply to all image formats
      Returns:
    • setOcrImageQuality

      public void setOcrImageQuality(float ocrImageQuality)
      Image quality used to render the page image for OCR. This does not apply to all image formats
    • getOcrImageScale

      public float getOcrImageScale()
      Deprecated.
      as of Tika 1.23, this is no longer used in rendering page images; use setOcrDPI(int)
      Scale to use if rendering a page and then running OCR on that rendered image. Default is 2.0f.
    • setOcrImageScale

      public void setOcrImageScale(float ocrImageScale)
      Deprecated.
      (as of Tika 1.23, this is no longer used in rendering page images)
      Parameters:
      ocrImageScale -
    • setExtractActions

      public void setExtractActions(boolean v)
      Whether or not to extract PDActions from the file. Most Action types are handled inline; javascript macros are processed as embedded documents.
      Parameters:
      v -
    • getExtractActions

      public boolean getExtractActions()
      Returns:
      whether or not to extract PDActions
      See Also:
    • getMaxMainMemoryBytes

      public long getMaxMainMemoryBytes()
      The maximum amount of memory to use when loading a pdf into a PDDocument. Additional buffering is done using a temp file.
      Returns:
    • setMaxMainMemoryBytes

      @Deprecated public void setMaxMainMemoryBytes(int maxMainMemoryBytes)
      Parameters:
      maxMainMemoryBytes -
    • setMaxMainMemoryBytes

      public void setMaxMainMemoryBytes(long maxMainMemoryBytes)
    • setSetKCMS

      public void setSetKCMS(boolean setKCMS)

      Whether to call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider"). KCMS is the unmaintained, legacy provider and is far faster than the newer replacement. However, there are stability and security risks with using the unmaintained legacy provider.

      Note, of course, that this is not thread safe. If the value is false in your first thread, and the second thread changes this to true, the system property in the first thread will now be true.

      Default is false.

      Parameters:
      setKCMS - whether or not to set KCMS
    • getSetKCMS

      public boolean getSetKCMS()
    • setDetectAngles

      public void setDetectAngles(boolean detectAngles)
    • getDetectAngles

      public boolean getDetectAngles()
    • equals

      public boolean equals(Object o)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • toString

      public String toString()
      Overrides:
      toString in class Object