Class LanguageProfilerBuilder

java.lang.Object
org.apache.tika.language.LanguageProfilerBuilder

@Deprecated public class LanguageProfilerBuilder extends Object
Deprecated.
This class runs a ngram analysis over submitted text, results might be used for automatic language identification. The similarity calculation is at experimental level. You have been warned. Methods are provided to build new NGramProfiles profiles.
  • Constructor Details

    • LanguageProfilerBuilder

      public LanguageProfilerBuilder(String name, int minlen, int maxlen)
      Deprecated.
      Constructs a new ngram profile
      Parameters:
      name - is the name of the profile
      minlen - is the min length of ngram sequences
      maxlen - is the max length of ngram sequences
    • LanguageProfilerBuilder

      public LanguageProfilerBuilder(String name)
      Deprecated.
      Constructs a new ngram profile where minlen=3, maxlen=3
      Parameters:
      name - is a name of profile, usually two length string
      Since:
      Tika 1.0
  • Method Details

    • getName

      public String getName()
      Deprecated.
      Returns:
      Returns the name.
    • add

      public void add(StringBuffer word)
      Deprecated.
      Adds ngrams from a single word to this profile
      Parameters:
      word - is the word to add
    • analyze

      public void analyze(StringBuilder text)
      Deprecated.
      Analyzes a piece of text
      Parameters:
      text - the text to be analyzed
    • getSorted

      public List<org.apache.tika.language.LanguageProfilerBuilder.NGramEntry> getSorted()
      Deprecated.
      Returns a sorted list of ngrams (sort done by 1. frequency 2. sequence)
      Returns:
      sorted vector of ngrams
    • toString

      public String toString()
      Deprecated.
      Overrides:
      toString in class Object
    • getSimilarity

      public float getSimilarity(LanguageProfilerBuilder another) throws TikaException
      Deprecated.
      Calculates a score how well NGramProfiles match each other
      Parameters:
      another - ngram profile to compare against
      Returns:
      similarity 0=exact match
      Throws:
      TikaException - if could not calculate a score
    • load

      public void load(InputStream is) throws IOException
      Deprecated.
      Loads a ngram profile from an InputStream (assumes UTF-8 encoded content)
      Parameters:
      is - the InputStream to read
      Throws:
      IOException
    • create

      public static LanguageProfilerBuilder create(String name, InputStream is, String encoding) throws TikaException
      Deprecated.
      Creates a new Language profile from (preferably quite large - 5-10k of lines) text file
      Parameters:
      name - to be given for the profile
      is - a stream to be read
      encoding - is the encoding of stream
      Throws:
      TikaException - if could not create a language profile
    • save

      public void save(OutputStream os) throws IOException
      Deprecated.
      Writes NGramProfile content into OutputStream, content is outputted with UTF-8 encoding
      Parameters:
      os - the Stream to output to
      Throws:
      IOException
    • main

      public static void main(String[] args)
      Deprecated.
      main method used for testing only
      Parameters:
      args -