public class EnglishTextTokenizer extends java.lang.Object implements TextTokenizer
A TextTokenizer implementation for the English languages.
| Constructor and Description |
|---|
EnglishTextTokenizer() |
| Modifier and Type | Method and Description |
|---|---|
protected java.lang.String |
convertWord(java.lang.String word)
Converts a
word into all upper case and checks if it
is a known stop word in english language. |
java.util.Set<java.lang.String> |
stopWords()
Gets all stop-words for a language.
|
java.util.Set<java.lang.String> |
tokenize(java.lang.String text)
Tokenize a
text and discards all stop-words from it. |
public java.util.Set<java.lang.String> tokenize(java.lang.String text)
throws java.io.IOException
TextTokenizerTokenize a text and discards all stop-words from it.
tokenize in interface TextTokenizertext - the text to tokenizejava.io.IOException - if a low-level I/O error occurs.public java.util.Set<java.lang.String> stopWords()
TextTokenizerGets all stop-words for a language.
stopWords in interface TextTokenizerprotected java.lang.String convertWord(java.lang.String word)
Converts a word into all upper case and checks if it
is a known stop word in english language. If it is,
then the word will be discarded and will not be
considered as a valid token.
word - the word