Index

A B C D E G H I L M N O P R S T U V W 
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form

A

addChar() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Add a character to the buffer that we're building for a token.
advance() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
advance() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
advance() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
advance() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
 
advance() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
advance() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
advance() - Method in interface org.tribuo.util.tokens.Tokenizer
Advances the tokenizer to the next token.
advance() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
 
apply(int, int, CharSequence) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
 
apply(int, int, CharSequence) - Method in interface org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitFunction
Applies the split function.

B

BREAK_ITERATOR - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
breakIteratorOptions - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
Options for the break iterator tokenizer.
BreakIteratorTokenizer - Class in org.tribuo.util.tokens.impl
A tokenizer wrapping a BreakIterator instance.
BreakIteratorTokenizer(Locale) - Constructor for class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
Constructs a BreakIteratorTokenizer using the specified locale.
BreakIteratorTokenizerOptions - Class in org.tribuo.util.tokens.options
CLI options for a BreakIteratorTokenizer.
BreakIteratorTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.BreakIteratorTokenizerOptions
 
buff - Variable in class org.tribuo.util.tokens.universal.Range
The character buffer.

C

charAt(int) - Method in class org.tribuo.util.tokens.universal.Range
 
clone() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.WhitespaceTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
 
clone() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
clone() - Method in interface org.tribuo.util.tokens.Tokenizer
Clones a tokenizer with it's configuration.
clone() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
 
CoreTokenizerOptions - Class in org.tribuo.util.tokens.options
CLI Options for all the tokenizers in the core package.
CoreTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.CoreTokenizerOptions
 
CoreTokenizerOptions.CoreTokenizerType - Enum in org.tribuo.util.tokens.options
Tokenizer type.
coreTokenizerType - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
Type of tokenizer
createSplitFunction(boolean) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
Creates a SplitFunctionTokenizer.SplitFunction that is used by the super class SplitFunctionTokenizer to determine how and where the tokenizer splits the input.
createSupplier(Tokenizer) - Static method in interface org.tribuo.util.tokens.Tokenizer
Creates a supplier from the specified tokenizer by cloning it.
createThreadLocal(Tokenizer) - Static method in interface org.tribuo.util.tokens.Tokenizer
Creates a thread local source of tokenizers by making a Tokenizer supplier using Tokenizer.createSupplier(Tokenizer).
createWhitespaceTokenizer() - Static method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
Creates a tokenizer that splits on whitespace.

D

DEFAULT_SPLIT_CHARACTERS - Static variable in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
The default split characters.
DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS - Static variable in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
The default characters which don't cause splits inside digits.
DEFAULT_UNKNOWN_TOKEN - Static variable in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
The default unknown token string.

E

end - Variable in class org.tribuo.util.tokens.Token
The end index.
end - Variable in class org.tribuo.util.tokens.universal.Range
The end index.

G

getEnd() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
getEnd() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
getEnd() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
getEnd() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
 
getEnd() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
getEnd() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
getEnd() - Method in interface org.tribuo.util.tokens.Tokenizer
Gets the ending offset (exclusive) of the current token in the character sequence
getEnd() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
 
getLanguageTag() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
Returns the locale string this tokenizer uses.
getMaxInputCharactersPerWord() - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
a getter for the maximum character count for a token to consider when Wordpiece.wordpiece(String) is applied to a token.
getMaxTokenLength() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Returns the maximum token length this tokenizer will generate.
getPos() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Gets the current position in the input.
getProvenance() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.impl.WhitespaceTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
getProvenance() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
 
getSplitCharacters() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
Deprecated.
getSplitPatternRegex() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
Gets the String form of the regex in use.
getSplitXDigitsCharacters() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
Deprecated.
getStart() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
getStart() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
getStart() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
getStart() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
 
getStart() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
getStart() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
getStart() - Method in interface org.tribuo.util.tokens.Tokenizer
Gets the starting character offset of the current token in the character sequence
getStart() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
 
getText() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
getText() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
getText() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
getText() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
 
getText() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
getText() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
getText() - Method in interface org.tribuo.util.tokens.Tokenizer
Gets the text of the current token, as a string
getText() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
 
getToken() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
getToken() - Method in interface org.tribuo.util.tokens.Tokenizer
Generates a Token object from the current state of the tokenizer.
getTokenizer() - Method in class org.tribuo.util.tokens.options.BreakIteratorTokenizerOptions
 
getTokenizer() - Method in class org.tribuo.util.tokens.options.CoreTokenizerOptions
 
getTokenizer() - Method in class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
 
getTokenizer() - Method in class org.tribuo.util.tokens.options.SplitPatternTokenizerOptions
 
getTokenizer() - Method in interface org.tribuo.util.tokens.options.TokenizerOptions
Creates the appropriately configured tokenizer.
getType() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
getType() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
getType() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
getType() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
 
getType() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
getType() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
getType() - Method in interface org.tribuo.util.tokens.Tokenizer
Gets the type of the current token.
getType() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
 
getUnknownToken() - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
a getter for the "unknown" token specified during initialization.

H

handleChar() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Handle a character to add to the token buffer.

I

incr - Variable in class org.tribuo.util.tokens.universal.Range
The value to increment by.
INFIX - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
Some tokenizers produce "sub-word" tokens.
isChinese(int) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
Determines if the provided codepoint is a Chinese character or not.
isControl(int) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
Determines if the provided codepoint is a control character or not.
isDigit(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
A quick check for whether a character is a digit.
isGenerateNgrams() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Does this tokenizer generate ngrams?
isGenerateUnigrams() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Does this tokenizer generate unigrams?
isLetterOrDigit(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.
isNgram(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).
isPunctuation(int) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
Determines if the input code point should be considered a character that is punctuation.
isSplitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
Deprecated.
isSplitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
Checks if this is a valid split character or whitespace.
isSplitXDigitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
Deprecated.
isSplitXDigitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
Checks if this a valid split character outside of a run of digits.
isWhitespace(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
A quick check for whether a character is whitespace.

L

languageTag - Variable in class org.tribuo.util.tokens.options.BreakIteratorTokenizerOptions
BreakIteratorTokenizer - The language tag of the locale to be used.
len - Variable in class org.tribuo.util.tokens.universal.Range
The token length.
length() - Method in class org.tribuo.util.tokens.Token
The number of characters in this token.
length() - Method in class org.tribuo.util.tokens.universal.Range
 

M

makeTokens() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Make one or more tokens from our current collected characters.
maxTokenLength - Variable in class org.tribuo.util.tokens.universal.UniversalTokenizer
The length of the longest token that we will generate.

N

NGRAM - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
An NGRAM corresponds to a token that might correspond to a character ngram - i.e.
NO_SPLIT - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
the current character is added to the in-progress token (i.e.
NO_SPLIT_INFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is infix.
NO_SPLIT_NGRAM - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is a ngram.
NO_SPLIT_PREFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is a prefix.
NO_SPLIT_PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is punctuation.
NO_SPLIT_SUFFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is a suffix.
NO_SPLIT_UNKNOWN - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is unknown.
NO_SPLIT_WHITESPACE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is whitespace.
NO_SPLIT_WORD - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Not a split, is a word.
NON - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
Creates a NonTokenizer.
NonTokenizer - Class in org.tribuo.util.tokens.impl
A convenience class for when you are required to provide a tokenizer but you don't actually want to split up the text into tokens.
NonTokenizer() - Constructor for class org.tribuo.util.tokens.impl.NonTokenizer
Constructs a NonTokenizer.

O

org.tribuo.util.tokens - package org.tribuo.util.tokens
Core definitions for tokenization.
org.tribuo.util.tokens.impl - package org.tribuo.util.tokens.impl
Simple fixed rule tokenizers.
org.tribuo.util.tokens.impl.wordpiece - package org.tribuo.util.tokens.impl.wordpiece
Provides an implementation of a Wordpiece tokenizer which implements to the Tribuo Tokenizer API.
org.tribuo.util.tokens.options - package org.tribuo.util.tokens.options
OLCUT Options implementations which can construct Tokenizers of various types.
org.tribuo.util.tokens.universal - package org.tribuo.util.tokens.universal
An implementation of a "universal" tokenizer which will split on word boundaries or character boundaries for languages where word boundaries are contextual.

P

postConfig() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
Used by the OLCUT configuration system, and should not be called by external code.
postConfig() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
 
postConfig() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
Used by the OLCUT configuration system, and should not be called by external code.
postConfig() - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
Used by the OLCUT configuration system, and should not be called by external code.
postConfig() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
Used by the OLCUT configuration system, and should not be called by external code.
PREFIX - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
Some tokenizers produce "sub-word" tokens.
punct(char, int) - Method in class org.tribuo.util.tokens.universal.Range
Sets this range to represent a punctuation character.
PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
A PUNCTUATION corresponds to tokens consisting of punctuation characters.

R

Range - Class in org.tribuo.util.tokens.universal
A range currently being segmented.
reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
 
reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.NonTokenizer
 
reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
 
reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
 
reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
 
reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
 
reset(CharSequence) - Method in interface org.tribuo.util.tokens.Tokenizer
Resets the tokenizer so that it operates on a new sequence of characters.
reset(CharSequence) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Reset state of tokenizer to clean slate.

S

set(char[], int, int) - Method in class org.tribuo.util.tokens.universal.Range
Sets the character range.
set(char, char, int) - Method in class org.tribuo.util.tokens.universal.Range
Sets the first two characters in the range, and the type to NGRAM.
set(char, int) - Method in class org.tribuo.util.tokens.universal.Range
Sets the first character in the range.
setGenerateNgrams(boolean) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Controls if the tokenizer generates ngrams.
setGenerateUnigrams(boolean) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Controls if the tokenizer generates unigrams.
setMaxTokenLength(int) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
Sets the maximum token length this tokenizer will generate.
setType(Token.TokenType) - Method in class org.tribuo.util.tokens.universal.Range
Sets the token type.
SHAPE - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
Creates a ShapeTokenizer.
ShapeTokenizer - Class in org.tribuo.util.tokens.impl
This tokenizer is loosely based on the notion of word shape which is a common feature used in NLP.
ShapeTokenizer() - Constructor for class org.tribuo.util.tokens.impl.ShapeTokenizer
Constructs a ShapeTokenizer.
SIMPLE_DEFAULT_PATTERN - Static variable in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
The default split pattern, which is [\.,]?\s+.
split(CharSequence) - Method in interface org.tribuo.util.tokens.Tokenizer
Uses this tokenizer to split a string into it's component substrings.
SPLIT_AFTER - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
The current character will cause the in-progress token to be completed after the current character is appended to the in-progress token.
SPLIT_AFTER_INFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after infix.
SPLIT_AFTER_NGRAM - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after a ngram.
SPLIT_AFTER_PREFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after a prefix.
SPLIT_AFTER_PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after punctuation.
SPLIT_AFTER_SUFFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after a suffix.
SPLIT_AFTER_UNKNOWN - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after an unknown value.
SPLIT_AFTER_WHITESPACE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after whitespace.
SPLIT_AFTER_WORD - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split after a word.
SPLIT_AT - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split at.
SPLIT_AT - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
The current character will cause the in-progress token to be completed.
SPLIT_BEFORE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before.
SPLIT_BEFORE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
The current character will cause the in-progress token to be completed the current character will be included in the next token.
SPLIT_BEFORE_AND_AFTER - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
The current character should cause the in-progress token to be completed.
SPLIT_BEFORE_AND_AFTER_INFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after infix.
SPLIT_BEFORE_AND_AFTER_NGRAM - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after a ngram.
SPLIT_BEFORE_AND_AFTER_PREFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after prefix.
SPLIT_BEFORE_AND_AFTER_PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after punctuation.
SPLIT_BEFORE_AND_AFTER_SUFFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after suffix.
SPLIT_BEFORE_AND_AFTER_UNKNOWN - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after unknown.
SPLIT_BEFORE_AND_AFTER_WHITESPACE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after whitespace.
SPLIT_BEFORE_AND_AFTER_WORD - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Split before and after a word.
SPLIT_CHARACTERS - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
SPLIT_PATTERN - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
SplitCharactersSplitterFunction(char[], char[]) - Constructor for class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
Constructs a splitting function using the supplied split characters.
SplitCharactersTokenizer - Class in org.tribuo.util.tokens.impl
This implementation of Tokenizer is instantiated with an array of characters that are considered split characters.
SplitCharactersTokenizer() - Constructor for class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
SplitCharactersTokenizer(char[], char[]) - Constructor for class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
 
SplitCharactersTokenizer.SplitCharactersSplitterFunction - Class in org.tribuo.util.tokens.impl
Splits tokens at the supplied characters.
splitCharactersTokenizerOptions - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
Options for the split characters tokenizer.
SplitCharactersTokenizerOptions - Class in org.tribuo.util.tokens.options
CLI options for a SplitCharactersTokenizer.
SplitCharactersTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
 
splitChars - Variable in class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
The characters to split on.
splitFunction - Variable in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
The splitting function.
SplitFunctionTokenizer - Class in org.tribuo.util.tokens.impl
This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens.
SplitFunctionTokenizer() - Constructor for class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
Constructs a tokenizer, used by OLCUT.
SplitFunctionTokenizer(SplitFunctionTokenizer.SplitFunction) - Constructor for class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
Creates a new tokenizer using the supplied split function.
SplitFunctionTokenizer.SplitFunction - Interface in org.tribuo.util.tokens.impl
An interface for checking if the text should be split at the supplied codepoint.
SplitFunctionTokenizer.SplitResult - Enum in org.tribuo.util.tokens.impl
SplitFunctionTokenizer.SplitType - Enum in org.tribuo.util.tokens.impl
Defines different ways that a tokenizer can split the input text at a given character.
SplitPatternTokenizer - Class in org.tribuo.util.tokens.impl
This implementation of Tokenizer is instantiated with a regular expression pattern which determines how to split a string into tokens.
SplitPatternTokenizer() - Constructor for class org.tribuo.util.tokens.impl.SplitPatternTokenizer
Initializes a case insensitive tokenizer with the pattern [\.,]?\s+
SplitPatternTokenizer(String) - Constructor for class org.tribuo.util.tokens.impl.SplitPatternTokenizer
Constructs a splitting tokenizer using the supplied regex.
splitPatternTokenizerOptions - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
Options for the split pattern tokenizer.
SplitPatternTokenizerOptions - Class in org.tribuo.util.tokens.options
CLI options for a SplitPatternTokenizer.
SplitPatternTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.SplitPatternTokenizerOptions
 
splitType - Variable in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
The split type.
splitXDigitsChars - Variable in class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
Characters to split on unless they appear between digits
start - Variable in class org.tribuo.util.tokens.Token
The start index.
start - Variable in class org.tribuo.util.tokens.universal.Range
The start index.
subSequence(int, int) - Method in class org.tribuo.util.tokens.universal.Range
 
SUFFIX - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
Some tokenizers produce "sub-word" tokens.

T

text - Variable in class org.tribuo.util.tokens.Token
The token text.
Token - Class in org.tribuo.util.tokens
A single token extracted from a String.
Token(String, int, int) - Constructor for class org.tribuo.util.tokens.Token
Constructs a token.
Token(String, int, int, Token.TokenType) - Constructor for class org.tribuo.util.tokens.Token
Constructs a token.
Token.TokenType - Enum in org.tribuo.util.tokens
Tokenizers may product multiple kinds of tokens, depending on the application to which they're being put.
TokenizationException - Exception in org.tribuo.util.tokens
Wraps exceptions thrown by tokenizers.
TokenizationException(String) - Constructor for exception org.tribuo.util.tokens.TokenizationException
Creates a TokenizationException with the specified message.
TokenizationException(String, Throwable) - Constructor for exception org.tribuo.util.tokens.TokenizationException
Creates a TokenizationException wrapping the supplied throwable with the specified message.
TokenizationException(Throwable) - Constructor for exception org.tribuo.util.tokens.TokenizationException
Creates a TokenizationException wrapping the supplied throwable.
tokenize(CharSequence) - Method in interface org.tribuo.util.tokens.Tokenizer
Uses this tokenizer to tokenize a string and return the list of tokens that were generated.
Tokenizer - Interface in org.tribuo.util.tokens
An interface for things that tokenize text: breaking it into words according to some set of rules.
TokenizerOptions - Interface in org.tribuo.util.tokens.options
CLI Options for creating a tokenizer.
tokenType - Variable in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
The token type.
toString() - Method in class org.tribuo.util.tokens.Token
 
toString() - Method in class org.tribuo.util.tokens.universal.Range
 
type - Variable in class org.tribuo.util.tokens.Token
The token type.
type - Variable in class org.tribuo.util.tokens.universal.Range
The current token type.

U

UNIVERSAL - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
UniversalTokenizer - Class in org.tribuo.util.tokens.universal
This class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine).
UniversalTokenizer() - Constructor for class org.tribuo.util.tokens.universal.UniversalTokenizer
Constructs a universal tokenizer which doesn't send punctuation.
UniversalTokenizer(boolean) - Constructor for class org.tribuo.util.tokens.universal.UniversalTokenizer
Constructs a universal tokenizer.
UNKNOWN - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
Some tokenizers may work in concert with vocabulary data.

V

valueOf(String) - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum org.tribuo.util.tokens.Token.TokenType
Returns the enum constant of this type with the specified name.
values() - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum org.tribuo.util.tokens.Token.TokenType
Returns an array containing the constants of this enum type, in the order they are declared.

W

WHITESPACE - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
Some tokenizers may produce tokens corresponding to whitespace (e.g.
whitespaceSplitCharacterFunction - Static variable in class org.tribuo.util.tokens.impl.WhitespaceTokenizer
The splitting function for whitespace, using Character.isWhitespace(char).
WhitespaceTokenizer - Class in org.tribuo.util.tokens.impl
A simple tokenizer that splits on whitespace.
WhitespaceTokenizer() - Constructor for class org.tribuo.util.tokens.impl.WhitespaceTokenizer
Constructs a tokenizer that splits on whitespace.
WORD - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary.
wordpiece(String) - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
Executes Wordpiece tokenization on the provided token.
Wordpiece - Class in org.tribuo.util.tokens.impl.wordpiece
This is vanilla implementation of the Wordpiece algorithm as found here: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py
Wordpiece(String) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
Constructs a wordpiece by reading the vocabulary from the supplied path.
Wordpiece(String, String, int) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
Wordpiece(Set<String>) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
Constructs a Wordpiece using the supplied vocab.
Wordpiece(Set<String>, String) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
Constructs a Wordpiece using the supplied vocabulary and unknown token.
Wordpiece(Set<String>, String, int) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
WordpieceBasicTokenizer - Class in org.tribuo.util.tokens.impl.wordpiece
This is a tokenizer that is used "upstream" of WordpieceTokenizer and implements much of the functionality of the 'BasicTokenizer' implementation in huggingface.
WordpieceBasicTokenizer() - Constructor for class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
Constructs a default tokenizer which tokenizes Chinese characters.
WordpieceBasicTokenizer(boolean) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
Constructs a tokenizer.
WordpieceTokenizer - Class in org.tribuo.util.tokens.impl.wordpiece
This Tokenizer is meant to be a reasonable approximation of the BertTokenizer defined here.
WordpieceTokenizer(Wordpiece, Tokenizer, boolean, boolean, Set<String>) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
Constructs a wordpiece tokenizer.
A B C D E G H I L M N O P R S T U V W 
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form