Class ArabicLetterTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable

@Deprecated public class ArabicLetterTokenizer extends LetterTokenizer
Deprecated.
(3.1) Use StandardTokenizer instead.
Tokenizer that breaks text into runs of letters and diacritics.

The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.

You must specify the required Version compatibility when creating ArabicLetterTokenizer:

  • As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See isTokenChar(int) and CharTokenizer.normalize(int) for details.
  • Constructor Details

    • ArabicLetterTokenizer

      public ArabicLetterTokenizer(Version matchVersion, Reader in)
      Deprecated.
      Construct a new ArabicLetterTokenizer.
      Parameters:
      matchVersion - Lucene version to match See
      invalid @link
      {@link <a href="#version">above</a>
      }
      in - the input to split up into tokens
    • ArabicLetterTokenizer

      public ArabicLetterTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader in)
      Deprecated.
      Construct a new ArabicLetterTokenizer using a given AttributeSource.AttributeFactory. * @param matchVersion Lucene version to match See
      invalid @link
      {@link <a href="#version">above</a>
      }
      Parameters:
      factory - the attribute factory to use for this Tokenizer
      in - the input to split up into tokens