Package org.apache.tika.language
Class LanguageProfilerBuilder
java.lang.Object
org.apache.tika.language.LanguageProfilerBuilder
Deprecated.
This class runs a ngram analysis over submitted text, results might be used
for automatic language identification.
The similarity calculation is at experimental level. You have been warned.
Methods are provided to build new NGramProfiles profiles.
-
Constructor Summary
ConstructorsConstructorDescriptionDeprecated.Constructs a new ngram profile where minlen=3, maxlen=3LanguageProfilerBuilder(String name, int minlen, int maxlen) Deprecated.Constructs a new ngram profile -
Method Summary
Modifier and TypeMethodDescriptionvoidadd(StringBuffer word) Deprecated.Adds ngrams from a single word to this profilevoidanalyze(StringBuilder text) Deprecated.Analyzes a piece of textstatic LanguageProfilerBuildercreate(String name, InputStream is, String encoding) Deprecated.Creates a new Language profile from (preferably quite large - 5-10k of lines) text filegetName()Deprecated.floatgetSimilarity(LanguageProfilerBuilder another) Deprecated.Calculates a score how well NGramProfiles match each otherList<org.apache.tika.language.LanguageProfilerBuilder.NGramEntry> Deprecated.Returns a sorted list of ngrams (sort done by 1.voidload(InputStream is) Deprecated.Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)static voidDeprecated.main method used for testing onlyvoidsave(OutputStream os) Deprecated.Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encodingtoString()Deprecated.
-
Constructor Details
-
LanguageProfilerBuilder
Deprecated.Constructs a new ngram profile- Parameters:
name- is the name of the profileminlen- is the min length of ngram sequencesmaxlen- is the max length of ngram sequences
-
LanguageProfilerBuilder
Deprecated.Constructs a new ngram profile where minlen=3, maxlen=3- Parameters:
name- is a name of profile, usually two length string- Since:
- Tika 1.0
-
-
Method Details
-
getName
Deprecated.- Returns:
- Returns the name.
-
add
Deprecated.Adds ngrams from a single word to this profile- Parameters:
word- is the word to add
-
analyze
Deprecated.Analyzes a piece of text- Parameters:
text- the text to be analyzed
-
getSorted
Deprecated.Returns a sorted list of ngrams (sort done by 1. frequency 2. sequence)- Returns:
- sorted vector of ngrams
-
toString
Deprecated. -
getSimilarity
Deprecated.Calculates a score how well NGramProfiles match each other- Parameters:
another- ngram profile to compare against- Returns:
- similarity 0=exact match
- Throws:
TikaException- if could not calculate a score
-
load
Deprecated.Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)- Parameters:
is- the InputStream to read- Throws:
IOException
-
create
public static LanguageProfilerBuilder create(String name, InputStream is, String encoding) throws TikaException Deprecated.Creates a new Language profile from (preferably quite large - 5-10k of lines) text file- Parameters:
name- to be given for the profileis- a stream to be readencoding- is the encoding of stream- Throws:
TikaException- if could not create a language profile
-
save
Deprecated.Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding- Parameters:
os- the Stream to output to- Throws:
IOException
-
main
Deprecated.main method used for testing only- Parameters:
args-
-