Package org.apache.lucene.analysis.miscellaneous
package org.apache.lucene.analysis.miscellaneous
Miscellaneous TokenStreams
-
ClassesClassDescriptionThis class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.Factory for
ASCIIFoldingFilter.A filter to apply normal capitalization rules to Tokens.Factory forCapitalizationFilter.Removes words that are too long or too short from the stream.Factory forCodepointCountFilter.An always exhausted token stream.When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines.Factory forHyphenatedWordsFilter.A TokenFilter that only keeps tokens with text contained in the required words.Factory forKeepWordFilter.Marks terms as keywords via theKeywordAttribute.Factory forKeywordMarkerFilter.This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once withKeywordAttribute.setKeyword(boolean)set totrueand once set tofalse.Factory forKeywordRepeatFilter.Removes words that are too long or too short from the stream.Factory forLengthFilter.This Analyzer limits the number of tokens while indexing.This TokenFilter limits the number of tokens while indexing.Factory forLimitTokenCountFilter.This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.Factory forLimitTokenPositionFilter.Deprecated.(4.0) use the pattern-based analysis in the analysis/pattern package instead.Marks terms as keywords via theKeywordAttribute.This analyzer is used to facilitate scenarios where different fields require different analysis techniques.Links twoPrefixAwareTokenFilter.Joins two token streams and leaves the last token of the first stream available to be used when updating the token values in the second stream based on that token.A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.Factory forRemoveDuplicatesTokenFilter.This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.Factory forScandinavianFoldingFilter.This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.Factory forScandinavianNormalizationFilter.Marks terms as keywords via theKeywordAttribute.ATokenStreamcontaining a single token.Provides the ability to override anyKeywordAttributeaware stemmer with custom dictionary-based stemming.This builder builds anFSTfor theStemmerOverrideFilterA read-only 4-byte FST backed map that allows fast case-insensitive key value lookups forStemmerOverrideFilterFactory forStemmerOverrideFilter.Trims leading and trailing whitespace from Tokens in the stream.Factory forTrimFilter.Splits words into subwords and performs optional transformations on subword groups.Factory forWordDelimiterFilter.A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterFilter rules.