Class FreeTextSuggester

java.lang.Object
org.apache.lucene.search.suggest.Lookup
org.apache.lucene.search.suggest.analyzing.FreeTextSuggester

public class FreeTextSuggester extends Lookup
Builds an ngram model from the text sent to build(org.apache.lucene.search.suggest.InputIterator) and predicts based on the last grams-1 tokens in the request sent to lookup(java.lang.CharSequence, boolean, int). This tries to handle the "long tail" of suggestions for when the incoming query is a never before seen query string.

Likely this suggester would only be used as a fallback, when the primary suggester fails to find any suggestions.

Note that the weight for each suggestion is unused, and the suggestions are the analyzed forms (so your analysis process should normally be very "light").

This uses the stupid backoff language model to smooth scores across ngram models; see "Large language models in machine translation", http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1126 for details.

From lookup(java.lang.CharSequence, boolean, int), the key of each result is the ngram token; the value is Long.MAX_VALUE * score (fixed point, cast to long). Divide by Long.MAX_VALUE to get the score back, which ranges from 0.0 to 1.0. onlyMorePopular is unused.

  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.lucene.search.suggest.Lookup

    Lookup.LookupPriorityQueue, Lookup.LookupResult
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final double
    The constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.
    static final String
    Codec name used in the header for the saved model.
    static final int
    By default we use a bigram model.
    static final byte
    The default character used to join multiple tokens into a single ngram token.
    static final int
    Current version of the the saved model file format.
    static final int
    Initial version of the the saved model file format.

    Fields inherited from class org.apache.lucene.search.suggest.Lookup

    CHARSEQUENCE_COMPARATOR
  • Constructor Summary

    Constructors
    Constructor
    Description
    Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.
    FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer)
    Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.
    FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)
    Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).
    FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)
    Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.).
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    build(InputIterator iterator)
    Builds up a new internal Lookup representation based on the given InputIterator.
    void
    build(InputIterator iterator, double ramBufferSizeMB)
    Build the suggest index, using up to the specified amount of temporary RAM while building.
    Returns the weight associated with an input string, or null if it does not exist.
    long
    Get the number of entries the lookup was built with
    boolean
    load(DataInput input)
    Discard current lookup data and load it from a previously saved copy.
    lookup(CharSequence key, boolean onlyMorePopular, int num)
    Look up a key and return possible completion for this key.
    lookup(CharSequence key, int num)
    Retrieve suggestions.
    long
    Returns byte size of the underlying FST.
    boolean
    store(DataOutput output)
    Persist the constructed lookup data to a directory.

    Methods inherited from class org.apache.lucene.search.suggest.Lookup

    build, load, store

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • CODEC_NAME

      public static final String CODEC_NAME
      Codec name used in the header for the saved model.
      See Also:
    • VERSION_START

      public static final int VERSION_START
      Initial version of the the saved model file format.
      See Also:
    • VERSION_CURRENT

      public static final int VERSION_CURRENT
      Current version of the the saved model file format.
      See Also:
    • DEFAULT_GRAMS

      public static final int DEFAULT_GRAMS
      By default we use a bigram model.
      See Also:
    • ALPHA

      public static final double ALPHA
      The constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.
      See Also:
    • DEFAULT_SEPARATOR

      public static final byte DEFAULT_SEPARATOR
      The default character used to join multiple tokens into a single ngram token. The input tokens produced by the analyzer must not contain this character.
      See Also:
  • Constructor Details

    • FreeTextSuggester

      public FreeTextSuggester(Analyzer analyzer)
      Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.
    • FreeTextSuggester

      public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer)
      Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.
    • FreeTextSuggester

      public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)
      Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).
    • FreeTextSuggester

      public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)
      Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.). The separator is passed to ShingleFilter.setTokenSeparator(java.lang.String) to join multiple tokens into a single ngram token; it must be an ascii (7-bit-clean) byte. No input tokens should have this byte, otherwise IllegalArgumentException is thrown.
  • Method Details

    • sizeInBytes

      public long sizeInBytes()
      Returns byte size of the underlying FST.
      Specified by:
      sizeInBytes in class Lookup
      Returns:
      ram size of the lookup implementation in bytes
    • build

      public void build(InputIterator iterator) throws IOException
      Description copied from class: Lookup
      Builds up a new internal Lookup representation based on the given InputIterator. The implementation might re-sort the data internally.
      Specified by:
      build in class Lookup
      Throws:
      IOException
    • build

      public void build(InputIterator iterator, double ramBufferSizeMB) throws IOException
      Build the suggest index, using up to the specified amount of temporary RAM while building. Note that the weights for the suggestions are ignored.
      Throws:
      IOException
    • store

      public boolean store(DataOutput output) throws IOException
      Description copied from class: Lookup
      Persist the constructed lookup data to a directory. Optional operation.
      Specified by:
      store in class Lookup
      Parameters:
      output - DataOutput to write the data to.
      Returns:
      true if successful, false if unsuccessful or not supported.
      Throws:
      IOException - when fatal IO error occurs.
    • load

      public boolean load(DataInput input) throws IOException
      Description copied from class: Lookup
      Discard current lookup data and load it from a previously saved copy. Optional operation.
      Specified by:
      load in class Lookup
      Parameters:
      input - the DataInput to load the lookup data.
      Returns:
      true if completed successfully, false if unsuccessful or not supported.
      Throws:
      IOException - when fatal IO error occurs.
    • lookup

      public List<Lookup.LookupResult> lookup(CharSequence key, boolean onlyMorePopular, int num)
      Description copied from class: Lookup
      Look up a key and return possible completion for this key.
      Specified by:
      lookup in class Lookup
      Parameters:
      key - lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.
      onlyMorePopular - return only more popular results
      num - maximum number of results to return
      Returns:
      a list of possible completions, with their relative weight (e.g. popularity)
    • getCount

      public long getCount()
      Description copied from class: Lookup
      Get the number of entries the lookup was built with
      Specified by:
      getCount in class Lookup
      Returns:
      total number of suggester entries
    • lookup

      public List<Lookup.LookupResult> lookup(CharSequence key, int num) throws IOException
      Retrieve suggestions.
      Throws:
      IOException
    • get

      public Object get(CharSequence key)
      Returns the weight associated with an input string, or null if it does not exist.