public abstract class BaseTextTokenizer extends java.lang.Object implements TextTokenizer
An abstract text tokenizer which tokenizes a given string. It discards certain words known as stop word depending on the language chosen.
| Constructor and Description |
|---|
BaseTextTokenizer() |
| Modifier and Type | Method and Description |
|---|---|
protected java.lang.String |
convertWord(java.lang.String word)
Converts a
word into all lower case and checks if it
is a known stop word. |
java.util.Set<java.lang.String> |
tokenize(java.lang.String text)
Tokenize a
text and discards all stop-words from it. |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitstopWordspublic java.util.Set<java.lang.String> tokenize(java.lang.String text)
throws java.io.IOException
TextTokenizerTokenize a text and discards all stop-words from it.
tokenize in interface TextTokenizertext - the text to tokenizejava.io.IOException - if a low-level I/O error occurs.protected java.lang.String convertWord(java.lang.String word)
Converts a word into all lower case and checks if it
is a known stop word. If it is, then the word will be
discarded and will not be considered as a valid token.
word - the word