Class TesseractOCRConfig

java.lang.Object
org.apache.tika.parser.ocr.TesseractOCRConfig
All Implemented Interfaces:
Serializable

public class TesseractOCRConfig extends Object implements Serializable
Configuration for TesseractOCRParser.

This allows to enable TesseractOCRParser and set its parameters:

TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath(tesseractFolder);
parseContext.set(TesseractOCRConfig.class, config);

Parameters can also be set by either editing the existing TesseractOCRConfig.properties file in, tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on the classpath.

See Also:
  • Constructor Details

    • TesseractOCRConfig

      public TesseractOCRConfig()
      Default contructor.
    • TesseractOCRConfig

      public TesseractOCRConfig(InputStream is)
      Loads properties from InputStream and then tries to close InputStream. If there is an IOException, this silently swallows the exception and goes back to the default.
      Parameters:
      is -
  • Method Details

    • getTesseractPath

      public String getTesseractPath()
      See Also:
    • setTesseractPath

      public void setTesseractPath(String tesseractPath)
      Set the path to the Tesseract executable's directory, needed if it is not on system path.

      Note that if you set this value, it is highly recommended that you also set the path to the 'tessdata' folder using setTessdataPath(java.lang.String).

    • getTessdataPath

      public String getTessdataPath()
      See Also:
    • setTessdataPath

      public void setTessdataPath(String tessdataPath)
      Set the path to the 'tessdata' folder, which contains language files and config files. In some cases (such as on Windows), this folder is found in the Tesseract installation, but in other cases (such as when Tesseract is built from source), it may be located elsewhere.
    • getLanguage

      public String getLanguage()
      See Also:
    • setLanguage

      public void setLanguage(String language)
      Set tesseract language dictionary to be used. Default is "eng". Multiple languages may be specified, separated by plus characters. e.g. "chi_tra+chi_sim"
    • getPageSegMode

      public String getPageSegMode()
      See Also:
    • setPageSegMode

      public void setPageSegMode(String pageSegMode)
      Set tesseract page segmentation mode. Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection)
    • getPageSeparator

      public String getPageSeparator()
      See Also:
    • setPageSeparator

      public void setPageSeparator(String pageSeparator)
      The page separator to use in plain text output. This corresponds to Tesseract's page_separator config option. The default here is the empty string (i.e. no page separators). Note that this is also the default in Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character. We are overriding Tesseract 4.0's default here.
      Parameters:
      pageSeparator -
    • setTrustedPageSeparator

      public void setTrustedPageSeparator(String pageSeparator)
      Same as setPageSeparator(String) but does not perform any checks on the string.
      Parameters:
      pageSeparator -
    • setPreserveInterwordSpacing

      public void setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
      Whether or not to maintain interword spacing. Default is false.
      Parameters:
      preserveInterwordSpacing -
    • getPreserveInterwordSpacing

      public boolean getPreserveInterwordSpacing()
      Returns:
      whether or not to maintain interword spacing.
    • getMinFileSizeToOcr

      public long getMinFileSizeToOcr()
      See Also:
    • setMinFileSizeToOcr

      public void setMinFileSizeToOcr(long minFileSizeToOcr)
      Set minimum file size to submit file to ocr. Default is 0.
    • getMaxFileSizeToOcr

      public long getMaxFileSizeToOcr()
      See Also:
    • setMaxFileSizeToOcr

      public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
      Set maximum file size to submit file to ocr. Default is Integer.MAX_VALUE.
    • setTimeout

      public void setTimeout(int timeout)
      Set maximum time (seconds) to wait for the ocring process to terminate. Default value is 120s.
    • getTimeout

      public int getTimeout()
      Returns:
      timeout value for Tesseract
      See Also:
    • setOutputType

      public void setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
      Set output type from ocr process. Default is "txt", but can be "hocr". Default value is TesseractOCRConfig.OUTPUT_TYPE.TXT.
    • setOutputType

      public void setOutputType(String outputType)
    • getOutputType

      public TesseractOCRConfig.OUTPUT_TYPE getOutputType()
      See Also:
    • isEnableImageProcessing

      public int isEnableImageProcessing()
      Returns:
      image processing is enabled or not
      See Also:
    • setEnableImageProcessing

      public void setEnableImageProcessing(int enableImageProcessing)
      Set the value to true if processing is to be enabled. Default value is false.
    • getDensity

      public int getDensity()
      Returns:
      the density
    • setDensity

      public void setDensity(int density)
      Parameters:
      density - the density to set. Valid range of values is 150-1200. Default value is 300.
    • getDepth

      public int getDepth()
      Returns:
      the depth
    • setDepth

      public void setDepth(int depth)
      Parameters:
      depth - the depth to set. Valid values are 2, 4, 8, 16, 32, 64, 256, 4096. Default value is 4.
    • getColorspace

      public String getColorspace()
      Returns:
      the colorspace
    • setColorspace

      public void setColorspace(String colorspace)
      Parameters:
      colorspace - the colorspace to set Deafult value is gray.
    • getFilter

      public String getFilter()
      Returns:
      the filter
    • setFilter

      public void setFilter(String filter)
      Parameters:
      filter - the filter to set. Valid values are point, hermite, cubic, box, gaussian, catrom, triangle, quadratic and mitchell. Default value is triangle.
    • getResize

      public int getResize()
      Returns:
      the resize
    • setResize

      public void setResize(int resize)
      Parameters:
      resize - the resize to set. Valid range of values is 100-900. Default value is 900.
    • getImageMagickPath

      public String getImageMagickPath()
      Returns:
      path to ImageMagick executable directory.
      See Also:
    • setImageMagickPath

      public void setImageMagickPath(String imageMagickPath)
      Set the path to the ImageMagick executable directory, needed if it is not on system path.
      Parameters:
      imageMagickPath - to ImageMagick executable directory.
    • getApplyRotation

      public boolean getApplyRotation()
      Returns:
      Whether or not a rotation value should be calculated and passed to ImageMagick before performing OCR. (Requires that Python is installed).
    • setApplyRotation

      public void setApplyRotation(boolean applyRotation)
      Sets whether or not a rotation value should be calculated and passed to ImageMagick.
      Parameters:
      applyRotation - to calculate and apply rotation, false to skip. Default is false, true required Python installed.
    • getOtherTesseractConfig

      public Map<String,String> getOtherTesseractConfig()
      See Also:
    • addOtherTesseractConfig

      public void addOtherTesseractConfig(String key, String value)
      Add a key-value pair to pass to Tesseract using its -c command line option. To see the possible options, run tesseract --print-parameters. You may also add these parameters in TesseractOCRConfig.properties; any key-value pair in the properties file where the key contains an underscore is passed directly to Tesseract.
      Parameters:
      key -
      value -