Package org.apache.tika.parser.ocr
Class TesseractOCRConfig
java.lang.Object
org.apache.tika.parser.ocr.TesseractOCRConfig
- All Implemented Interfaces:
Serializable
Configuration for TesseractOCRParser.
This allows to enable TesseractOCRParser and set its parameters:
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath(tesseractFolder);
parseContext.set(TesseractOCRConfig.class, config);
Parameters can also be set by either editing the existing TesseractOCRConfig.properties file in, tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on the classpath.
- See Also:
-
Nested Class Summary
Nested Classes -
Constructor Summary
ConstructorsConstructorDescriptionDefault contructor.Loads properties from InputStream and then tries to close InputStream. -
Method Summary
Modifier and TypeMethodDescriptionvoidaddOtherTesseractConfig(String key, String value) Add a key-value pair to pass to Tesseract using its -c command line option.booleanintintgetDepth()longlongbooleanintintintvoidsetApplyRotation(boolean applyRotation) Sets whether or not a rotation value should be calculated and passed to ImageMagick.voidsetColorspace(String colorspace) voidsetDensity(int density) voidsetDepth(int depth) voidsetEnableImageProcessing(int enableImageProcessing) Set the value to true if processing is to be enabled.voidvoidsetImageMagickPath(String imageMagickPath) Set the path to the ImageMagick executable directory, needed if it is not on system path.voidsetLanguage(String language) Set tesseract language dictionary to be used.voidsetMaxFileSizeToOcr(long maxFileSizeToOcr) Set maximum file size to submit file to ocr.voidsetMinFileSizeToOcr(long minFileSizeToOcr) Set minimum file size to submit file to ocr.voidsetOutputType(String outputType) voidsetOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType) Set output type from ocr process.voidsetPageSegMode(String pageSegMode) Set tesseract page segmentation mode.voidsetPageSeparator(String pageSeparator) The page separator to use in plain text output.voidsetPreserveInterwordSpacing(boolean preserveInterwordSpacing) Whether or not to maintain interword spacing.voidsetResize(int resize) voidsetTessdataPath(String tessdataPath) Set the path to the 'tessdata' folder, which contains language files and config files.voidsetTesseractPath(String tesseractPath) Set the path to the Tesseract executable's directory, needed if it is not on system path.voidsetTimeout(int timeout) Set maximum time (seconds) to wait for the ocring process to terminate.voidsetTrustedPageSeparator(String pageSeparator) Same assetPageSeparator(String)but does not perform any checks on the string.
-
Constructor Details
-
TesseractOCRConfig
public TesseractOCRConfig()Default contructor. -
TesseractOCRConfig
Loads properties from InputStream and then tries to close InputStream. If there is an IOException, this silently swallows the exception and goes back to the default.- Parameters:
is-
-
-
Method Details
-
getTesseractPath
- See Also:
-
setTesseractPath
Set the path to the Tesseract executable's directory, needed if it is not on system path.Note that if you set this value, it is highly recommended that you also set the path to the 'tessdata' folder using
setTessdataPath(java.lang.String). -
getTessdataPath
- See Also:
-
setTessdataPath
Set the path to the 'tessdata' folder, which contains language files and config files. In some cases (such as on Windows), this folder is found in the Tesseract installation, but in other cases (such as when Tesseract is built from source), it may be located elsewhere. -
getLanguage
- See Also:
-
setLanguage
Set tesseract language dictionary to be used. Default is "eng". Multiple languages may be specified, separated by plus characters. e.g. "chi_tra+chi_sim" -
getPageSegMode
- See Also:
-
setPageSegMode
Set tesseract page segmentation mode. Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection) -
getPageSeparator
- See Also:
-
setPageSeparator
The page separator to use in plain text output. This corresponds to Tesseract's page_separator config option. The default here is the empty string (i.e. no page separators). Note that this is also the default in Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character. We are overriding Tesseract 4.0's default here.- Parameters:
pageSeparator-
-
setTrustedPageSeparator
Same assetPageSeparator(String)but does not perform any checks on the string.- Parameters:
pageSeparator-
-
setPreserveInterwordSpacing
public void setPreserveInterwordSpacing(boolean preserveInterwordSpacing) Whether or not to maintain interword spacing. Default isfalse.- Parameters:
preserveInterwordSpacing-
-
getPreserveInterwordSpacing
public boolean getPreserveInterwordSpacing()- Returns:
- whether or not to maintain interword spacing.
-
getMinFileSizeToOcr
public long getMinFileSizeToOcr()- See Also:
-
setMinFileSizeToOcr
public void setMinFileSizeToOcr(long minFileSizeToOcr) Set minimum file size to submit file to ocr. Default is 0. -
getMaxFileSizeToOcr
public long getMaxFileSizeToOcr()- See Also:
-
setMaxFileSizeToOcr
public void setMaxFileSizeToOcr(long maxFileSizeToOcr) Set maximum file size to submit file to ocr. Default is Integer.MAX_VALUE. -
setTimeout
public void setTimeout(int timeout) Set maximum time (seconds) to wait for the ocring process to terminate. Default value is 120s. -
getTimeout
public int getTimeout()- Returns:
- timeout value for Tesseract
- See Also:
-
setOutputType
Set output type from ocr process. Default is "txt", but can be "hocr". Default value isTesseractOCRConfig.OUTPUT_TYPE.TXT. -
setOutputType
-
getOutputType
- See Also:
-
isEnableImageProcessing
public int isEnableImageProcessing()- Returns:
- image processing is enabled or not
- See Also:
-
setEnableImageProcessing
public void setEnableImageProcessing(int enableImageProcessing) Set the value to true if processing is to be enabled. Default value is false. -
getDensity
public int getDensity()- Returns:
- the density
-
setDensity
public void setDensity(int density) - Parameters:
density- the density to set. Valid range of values is 150-1200. Default value is 300.
-
getDepth
public int getDepth()- Returns:
- the depth
-
setDepth
public void setDepth(int depth) - Parameters:
depth- the depth to set. Valid values are 2, 4, 8, 16, 32, 64, 256, 4096. Default value is 4.
-
getColorspace
- Returns:
- the colorspace
-
setColorspace
- Parameters:
colorspace- the colorspace to set Deafult value is gray.
-
getFilter
- Returns:
- the filter
-
setFilter
- Parameters:
filter- the filter to set. Valid values are point, hermite, cubic, box, gaussian, catrom, triangle, quadratic and mitchell. Default value is triangle.
-
getResize
public int getResize()- Returns:
- the resize
-
setResize
public void setResize(int resize) - Parameters:
resize- the resize to set. Valid range of values is 100-900. Default value is 900.
-
getImageMagickPath
- Returns:
- path to ImageMagick executable directory.
- See Also:
-
setImageMagickPath
Set the path to the ImageMagick executable directory, needed if it is not on system path.- Parameters:
imageMagickPath- to ImageMagick executable directory.
-
getApplyRotation
public boolean getApplyRotation()- Returns:
- Whether or not a rotation value should be calculated and passed to ImageMagick before performing OCR. (Requires that Python is installed).
-
setApplyRotation
public void setApplyRotation(boolean applyRotation) Sets whether or not a rotation value should be calculated and passed to ImageMagick.- Parameters:
applyRotation- to calculate and apply rotation, false to skip. Default is false, true required Python installed.
-
getOtherTesseractConfig
- See Also:
-
addOtherTesseractConfig
Add a key-value pair to pass to Tesseract using its -c command line option. To see the possible options, run tesseract --print-parameters. You may also add these parameters in TesseractOCRConfig.properties; any key-value pair in the properties file where the key contains an underscore is passed directly to Tesseract.- Parameters:
key-value-
-