public abstract class AbstractWordSplitter extends Object
This is especially useful for German words but it will work with all languages. The order of the words in the collection will be identical to their appearance in the connected word. It's good to provide a large dictionary.
Please note: We don't expect to have any special chars here (!":;,.-_, etc.). Only a set of characters and only one word.
| Constructor and Description |
|---|
AbstractWordSplitter(boolean hideInterfixCharacters)
Create a word splitter that uses the embedded dictionary.
|
AbstractWordSplitter(boolean hideInterfixCharacters,
File plainTextDict) |
AbstractWordSplitter(boolean hideInterfixCharacters,
InputStream plainTextDict) |
AbstractWordSplitter(boolean hideInterfixCharacters,
Set<String> words) |
| Modifier and Type | Method and Description |
|---|---|
void |
addException(String completeWord,
List<String> wordParts) |
List<List<String>> |
getAllSplits(String word)
Experimental: Split a word with unknown parts, typically because one part
has a typo.
|
protected abstract int |
getDefaultMinimumWordLength() |
protected abstract de.danielnaber.jwordsplitter.GermanInterfixDisambiguator |
getDisambiguator() |
protected abstract Collection<String> |
getInterfixCharacters()
Interfix elements in lowercase, e.g. at least "s" for German.
|
List<String> |
getSubWords(String word) |
protected abstract Set<String> |
getWordList() |
protected abstract Set<String> |
getWordList(InputStream stream) |
void |
setExceptionFile(String filename) |
void |
setMaximumWordLength(int len)
Words longer than this will throw an
IllegalArgumentException to avoid extremely long
processing times. |
void |
setMinimumWordLength(int len) |
void |
setStrictMode(boolean strictMode)
When set to true, words will only be split if all parts are words.
|
List<String> |
splitWord(String word) |
List<String> |
splitWord(String word,
boolean collectSubwords) |
public AbstractWordSplitter(boolean hideInterfixCharacters)
throws IOException
hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain
the connecting character (a.k.a. interfix)IOExceptionpublic AbstractWordSplitter(boolean hideInterfixCharacters,
InputStream plainTextDict)
throws IOException
hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain
the connecting character (a.k.a. interfix)plainTextDict - a stream of a text file with one word per line, to be used instead of the embedded dictionary,
must be in UTF-8 formatIOExceptionpublic AbstractWordSplitter(boolean hideInterfixCharacters,
File plainTextDict)
throws IOException
hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain
the connecting character (a.k.a. interfix)plainTextDict - a stream of a text file with one word per line, to be used instead of the embedded dictionary,
must be in UTF-8 formatIOExceptionpublic AbstractWordSplitter(boolean hideInterfixCharacters,
Set<String> words)
throws IOException
hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain
the connecting character (a.k.a. interfix)words - the compound part wordsIOExceptionprotected abstract Set<String> getWordList(InputStream stream) throws IOException
IOExceptionprotected abstract Set<String> getWordList() throws IOException
IOExceptionprotected abstract de.danielnaber.jwordsplitter.GermanInterfixDisambiguator getDisambiguator()
protected abstract int getDefaultMinimumWordLength()
protected abstract Collection<String> getInterfixCharacters()
public void setMinimumWordLength(int len)
public void setMaximumWordLength(int len)
IllegalArgumentException to avoid extremely long
processing times. The default is 70.public void setExceptionFile(String filename) throws IOException
filename - UTF-8 encoded file with exceptions in the classpath, one exception per line, using pipe as delimiter.
Example: Pilot|sendungIOExceptionpublic void addException(String completeWord, List<String> wordParts)
completeWord - the word for which an exception is to be defined (will be considered case-insensitive)wordParts - the parts in which the word is to be split (use a list with a single element if the word should not be split)public void setStrictMode(boolean strictMode)
public List<List<String>> getAllSplits(String word)
Copyright © 2021. All rights reserved.