Index
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form
A
- addChar() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Add a character to the buffer that we're building for a token.
- advance() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- advance() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- advance() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- advance() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- advance() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- advance() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- advance() - Method in interface org.tribuo.util.tokens.Tokenizer
-
Advances the tokenizer to the next token.
- advance() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
- apply(int, int, CharSequence) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
- apply(int, int, CharSequence) - Method in interface org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitFunction
-
Applies the split function.
B
- BREAK_ITERATOR - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Creates a
BreakIteratorTokenizer. - breakIteratorOptions - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
-
Options for the break iterator tokenizer.
- BreakIteratorTokenizer - Class in org.tribuo.util.tokens.impl
-
A tokenizer wrapping a
BreakIteratorinstance. - BreakIteratorTokenizer(Locale) - Constructor for class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
-
Constructs a BreakIteratorTokenizer using the specified locale.
- BreakIteratorTokenizerOptions - Class in org.tribuo.util.tokens.options
-
CLI options for a
BreakIteratorTokenizer. - BreakIteratorTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.BreakIteratorTokenizerOptions
- buff - Variable in class org.tribuo.util.tokens.universal.Range
-
The character buffer.
C
- charAt(int) - Method in class org.tribuo.util.tokens.universal.Range
- clone() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.WhitespaceTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
- clone() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- clone() - Method in interface org.tribuo.util.tokens.Tokenizer
-
Clones a tokenizer with it's configuration.
- clone() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
- CoreTokenizerOptions - Class in org.tribuo.util.tokens.options
-
CLI Options for all the tokenizers in the core package.
- CoreTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.CoreTokenizerOptions
- CoreTokenizerOptions.CoreTokenizerType - Enum in org.tribuo.util.tokens.options
-
Tokenizer type.
- coreTokenizerType - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
-
Type of tokenizer
- createSplitFunction(boolean) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
-
Creates a
SplitFunctionTokenizer.SplitFunctionthat is used by the super classSplitFunctionTokenizerto determine how and where the tokenizer splits the input. - createSupplier(Tokenizer) - Static method in interface org.tribuo.util.tokens.Tokenizer
-
Creates a supplier from the specified tokenizer by cloning it.
- createThreadLocal(Tokenizer) - Static method in interface org.tribuo.util.tokens.Tokenizer
-
Creates a thread local source of tokenizers by making a Tokenizer supplier using
Tokenizer.createSupplier(Tokenizer). - createWhitespaceTokenizer() - Static method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
Creates a tokenizer that splits on whitespace.
D
- DEFAULT_SPLIT_CHARACTERS - Static variable in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
The default split characters.
- DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS - Static variable in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
The default characters which don't cause splits inside digits.
- DEFAULT_UNKNOWN_TOKEN - Static variable in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
The default unknown token string.
E
- end - Variable in class org.tribuo.util.tokens.Token
-
The end index.
- end - Variable in class org.tribuo.util.tokens.universal.Range
-
The end index.
G
- getEnd() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- getEnd() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- getEnd() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- getEnd() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- getEnd() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- getEnd() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- getEnd() - Method in interface org.tribuo.util.tokens.Tokenizer
-
Gets the ending offset (exclusive) of the current token in the character sequence
- getEnd() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
- getLanguageTag() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
-
Returns the locale string this tokenizer uses.
- getMaxInputCharactersPerWord() - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
a getter for the maximum character count for a token to consider when
Wordpiece.wordpiece(String)is applied to a token. - getMaxTokenLength() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Returns the maximum token length this tokenizer will generate.
- getPos() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Gets the current position in the input.
- getProvenance() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.impl.WhitespaceTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- getProvenance() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
- getSplitCharacters() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
Deprecated.
- getSplitPatternRegex() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
-
Gets the String form of the regex in use.
- getSplitXDigitsCharacters() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
Deprecated.
- getStart() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- getStart() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- getStart() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- getStart() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- getStart() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- getStart() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- getStart() - Method in interface org.tribuo.util.tokens.Tokenizer
-
Gets the starting character offset of the current token in the character sequence
- getStart() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
- getText() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- getText() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- getText() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- getText() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- getText() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- getText() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- getText() - Method in interface org.tribuo.util.tokens.Tokenizer
-
Gets the text of the current token, as a string
- getText() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
- getToken() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- getToken() - Method in interface org.tribuo.util.tokens.Tokenizer
-
Generates a Token object from the current state of the tokenizer.
- getTokenizer() - Method in class org.tribuo.util.tokens.options.BreakIteratorTokenizerOptions
- getTokenizer() - Method in class org.tribuo.util.tokens.options.CoreTokenizerOptions
- getTokenizer() - Method in class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
- getTokenizer() - Method in class org.tribuo.util.tokens.options.SplitPatternTokenizerOptions
- getTokenizer() - Method in interface org.tribuo.util.tokens.options.TokenizerOptions
-
Creates the appropriately configured tokenizer.
- getType() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- getType() - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- getType() - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- getType() - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- getType() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- getType() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- getType() - Method in interface org.tribuo.util.tokens.Tokenizer
-
Gets the type of the current token.
- getType() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
- getUnknownToken() - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
a getter for the "unknown" token specified during initialization.
H
- handleChar() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Handle a character to add to the token buffer.
I
- incr - Variable in class org.tribuo.util.tokens.universal.Range
-
The value to increment by.
- INFIX - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
Some tokenizers produce "sub-word" tokens.
- isChinese(int) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
-
Determines if the provided codepoint is a Chinese character or not.
- isControl(int) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
-
Determines if the provided codepoint is a control character or not.
- isDigit(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
A quick check for whether a character is a digit.
- isGenerateNgrams() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Does this tokenizer generate ngrams?
- isGenerateUnigrams() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Does this tokenizer generate unigrams?
- isLetterOrDigit(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.
- isNgram(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).
- isPunctuation(int) - Static method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
-
Determines if the input code point should be considered a character that is punctuation.
- isSplitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
Deprecated.
- isSplitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
-
Checks if this is a valid split character or whitespace.
- isSplitXDigitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
Deprecated.
- isSplitXDigitCharacter(char) - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
-
Checks if this a valid split character outside of a run of digits.
- isWhitespace(char) - Static method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
A quick check for whether a character is whitespace.
L
- languageTag - Variable in class org.tribuo.util.tokens.options.BreakIteratorTokenizerOptions
-
BreakIteratorTokenizer - The language tag of the locale to be used.
- len - Variable in class org.tribuo.util.tokens.universal.Range
-
The token length.
- length() - Method in class org.tribuo.util.tokens.Token
-
The number of characters in this token.
- length() - Method in class org.tribuo.util.tokens.universal.Range
M
- makeTokens() - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Make one or more tokens from our current collected characters.
- maxTokenLength - Variable in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
The length of the longest token that we will generate.
N
- NGRAM - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
An NGRAM corresponds to a token that might correspond to a character ngram - i.e.
- NO_SPLIT - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
-
the current character is added to the in-progress token (i.e.
- NO_SPLIT_INFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is infix.
- NO_SPLIT_NGRAM - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is a ngram.
- NO_SPLIT_PREFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is a prefix.
- NO_SPLIT_PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is punctuation.
- NO_SPLIT_SUFFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is a suffix.
- NO_SPLIT_UNKNOWN - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is unknown.
- NO_SPLIT_WHITESPACE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is whitespace.
- NO_SPLIT_WORD - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Not a split, is a word.
- NON - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Creates a
NonTokenizer. - NonTokenizer - Class in org.tribuo.util.tokens.impl
-
A convenience class for when you are required to provide a tokenizer but you don't actually want to split up the text into tokens.
- NonTokenizer() - Constructor for class org.tribuo.util.tokens.impl.NonTokenizer
-
Constructs a NonTokenizer.
O
- org.tribuo.util.tokens - package org.tribuo.util.tokens
-
Core definitions for tokenization.
- org.tribuo.util.tokens.impl - package org.tribuo.util.tokens.impl
-
Simple fixed rule tokenizers.
- org.tribuo.util.tokens.impl.wordpiece - package org.tribuo.util.tokens.impl.wordpiece
-
Provides an implementation of a Wordpiece tokenizer which implements to the Tribuo
TokenizerAPI. - org.tribuo.util.tokens.options - package org.tribuo.util.tokens.options
-
OLCUT
Optionsimplementations which can constructTokenizers of various types. - org.tribuo.util.tokens.universal - package org.tribuo.util.tokens.universal
-
An implementation of a "universal" tokenizer which will split on word boundaries or character boundaries for languages where word boundaries are contextual.
P
- postConfig() - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
-
Used by the OLCUT configuration system, and should not be called by external code.
- postConfig() - Method in class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
- postConfig() - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
-
Used by the OLCUT configuration system, and should not be called by external code.
- postConfig() - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
Used by the OLCUT configuration system, and should not be called by external code.
- postConfig() - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
-
Used by the OLCUT configuration system, and should not be called by external code.
- PREFIX - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
Some tokenizers produce "sub-word" tokens.
- punct(char, int) - Method in class org.tribuo.util.tokens.universal.Range
-
Sets this range to represent a punctuation character.
- PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
A PUNCTUATION corresponds to tokens consisting of punctuation characters.
R
- Range - Class in org.tribuo.util.tokens.universal
-
A range currently being segmented.
- reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.NonTokenizer
- reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.ShapeTokenizer
- reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
- reset(CharSequence) - Method in class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- reset(CharSequence) - Method in interface org.tribuo.util.tokens.Tokenizer
-
Resets the tokenizer so that it operates on a new sequence of characters.
- reset(CharSequence) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Reset state of tokenizer to clean slate.
S
- set(char[], int, int) - Method in class org.tribuo.util.tokens.universal.Range
-
Sets the character range.
- set(char, char, int) - Method in class org.tribuo.util.tokens.universal.Range
-
Sets the first two characters in the range, and the type to NGRAM.
- set(char, int) - Method in class org.tribuo.util.tokens.universal.Range
-
Sets the first character in the range.
- setGenerateNgrams(boolean) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Controls if the tokenizer generates ngrams.
- setGenerateUnigrams(boolean) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Controls if the tokenizer generates unigrams.
- setMaxTokenLength(int) - Method in class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Sets the maximum token length this tokenizer will generate.
- setType(Token.TokenType) - Method in class org.tribuo.util.tokens.universal.Range
-
Sets the token type.
- SHAPE - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Creates a
ShapeTokenizer. - ShapeTokenizer - Class in org.tribuo.util.tokens.impl
-
This tokenizer is loosely based on the notion of word shape which is a common feature used in NLP.
- ShapeTokenizer() - Constructor for class org.tribuo.util.tokens.impl.ShapeTokenizer
-
Constructs a ShapeTokenizer.
- SIMPLE_DEFAULT_PATTERN - Static variable in class org.tribuo.util.tokens.impl.SplitPatternTokenizer
-
The default split pattern, which is [\.,]?\s+.
- split(CharSequence) - Method in interface org.tribuo.util.tokens.Tokenizer
-
Uses this tokenizer to split a string into it's component substrings.
- SPLIT_AFTER - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
-
The current character will cause the in-progress token to be completed after the current character is appended to the in-progress token.
- SPLIT_AFTER_INFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after infix.
- SPLIT_AFTER_NGRAM - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after a ngram.
- SPLIT_AFTER_PREFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after a prefix.
- SPLIT_AFTER_PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after punctuation.
- SPLIT_AFTER_SUFFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after a suffix.
- SPLIT_AFTER_UNKNOWN - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after an unknown value.
- SPLIT_AFTER_WHITESPACE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after whitespace.
- SPLIT_AFTER_WORD - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split after a word.
- SPLIT_AT - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split at.
- SPLIT_AT - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
-
The current character will cause the in-progress token to be completed.
- SPLIT_BEFORE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before.
- SPLIT_BEFORE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
-
The current character will cause the in-progress token to be completed the current character will be included in the next token.
- SPLIT_BEFORE_AND_AFTER - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
-
The current character should cause the in-progress token to be completed.
- SPLIT_BEFORE_AND_AFTER_INFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after infix.
- SPLIT_BEFORE_AND_AFTER_NGRAM - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after a ngram.
- SPLIT_BEFORE_AND_AFTER_PREFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after prefix.
- SPLIT_BEFORE_AND_AFTER_PUNCTUATION - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after punctuation.
- SPLIT_BEFORE_AND_AFTER_SUFFIX - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after suffix.
- SPLIT_BEFORE_AND_AFTER_UNKNOWN - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after unknown.
- SPLIT_BEFORE_AND_AFTER_WHITESPACE - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after whitespace.
- SPLIT_BEFORE_AND_AFTER_WORD - Enum constant in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Split before and after a word.
- SPLIT_CHARACTERS - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Creates a
SplitCharactersTokenizer. - SPLIT_PATTERN - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Creates a
SplitPatternTokenizer. - SplitCharactersSplitterFunction(char[], char[]) - Constructor for class org.tribuo.util.tokens.impl.SplitCharactersTokenizer.SplitCharactersSplitterFunction
-
Constructs a splitting function using the supplied split characters.
- SplitCharactersTokenizer - Class in org.tribuo.util.tokens.impl
-
This implementation of
Tokenizeris instantiated with an array of characters that are considered split characters. - SplitCharactersTokenizer() - Constructor for class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
-
Creates a default split characters tokenizer using
SplitCharactersTokenizer.DEFAULT_SPLIT_CHARACTERSandSplitCharactersTokenizer.DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS. - SplitCharactersTokenizer(char[], char[]) - Constructor for class org.tribuo.util.tokens.impl.SplitCharactersTokenizer
- SplitCharactersTokenizer.SplitCharactersSplitterFunction - Class in org.tribuo.util.tokens.impl
-
Splits tokens at the supplied characters.
- splitCharactersTokenizerOptions - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
-
Options for the split characters tokenizer.
- SplitCharactersTokenizerOptions - Class in org.tribuo.util.tokens.options
-
CLI options for a
SplitCharactersTokenizer. - SplitCharactersTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
- splitChars - Variable in class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
-
The characters to split on.
- splitFunction - Variable in class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
-
The splitting function.
- SplitFunctionTokenizer - Class in org.tribuo.util.tokens.impl
-
This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens.
- SplitFunctionTokenizer() - Constructor for class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
-
Constructs a tokenizer, used by OLCUT.
- SplitFunctionTokenizer(SplitFunctionTokenizer.SplitFunction) - Constructor for class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
-
Creates a new tokenizer using the supplied split function.
- SplitFunctionTokenizer.SplitFunction - Interface in org.tribuo.util.tokens.impl
-
An interface for checking if the text should be split at the supplied codepoint.
- SplitFunctionTokenizer.SplitResult - Enum in org.tribuo.util.tokens.impl
-
A combination of a
SplitFunctionTokenizer.SplitTypeand aToken.TokenType. - SplitFunctionTokenizer.SplitType - Enum in org.tribuo.util.tokens.impl
-
Defines different ways that a tokenizer can split the input text at a given character.
- SplitPatternTokenizer - Class in org.tribuo.util.tokens.impl
-
This implementation of
Tokenizeris instantiated with a regular expression pattern which determines how to split a string into tokens. - SplitPatternTokenizer() - Constructor for class org.tribuo.util.tokens.impl.SplitPatternTokenizer
-
Initializes a case insensitive tokenizer with the pattern [\.,]?\s+
- SplitPatternTokenizer(String) - Constructor for class org.tribuo.util.tokens.impl.SplitPatternTokenizer
-
Constructs a splitting tokenizer using the supplied regex.
- splitPatternTokenizerOptions - Variable in class org.tribuo.util.tokens.options.CoreTokenizerOptions
-
Options for the split pattern tokenizer.
- SplitPatternTokenizerOptions - Class in org.tribuo.util.tokens.options
-
CLI options for a
SplitPatternTokenizer. - SplitPatternTokenizerOptions() - Constructor for class org.tribuo.util.tokens.options.SplitPatternTokenizerOptions
- splitType - Variable in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
The split type.
- splitXDigitsChars - Variable in class org.tribuo.util.tokens.options.SplitCharactersTokenizerOptions
-
Characters to split on unless they appear between digits
- start - Variable in class org.tribuo.util.tokens.Token
-
The start index.
- start - Variable in class org.tribuo.util.tokens.universal.Range
-
The start index.
- subSequence(int, int) - Method in class org.tribuo.util.tokens.universal.Range
- SUFFIX - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
Some tokenizers produce "sub-word" tokens.
T
- text - Variable in class org.tribuo.util.tokens.Token
-
The token text.
- Token - Class in org.tribuo.util.tokens
-
A single token extracted from a String.
- Token(String, int, int) - Constructor for class org.tribuo.util.tokens.Token
-
Constructs a token.
- Token(String, int, int, Token.TokenType) - Constructor for class org.tribuo.util.tokens.Token
-
Constructs a token.
- Token.TokenType - Enum in org.tribuo.util.tokens
-
Tokenizers may product multiple kinds of tokens, depending on the application to which they're being put.
- TokenizationException - Exception in org.tribuo.util.tokens
-
Wraps exceptions thrown by tokenizers.
- TokenizationException(String) - Constructor for exception org.tribuo.util.tokens.TokenizationException
-
Creates a TokenizationException with the specified message.
- TokenizationException(String, Throwable) - Constructor for exception org.tribuo.util.tokens.TokenizationException
-
Creates a TokenizationException wrapping the supplied throwable with the specified message.
- TokenizationException(Throwable) - Constructor for exception org.tribuo.util.tokens.TokenizationException
-
Creates a TokenizationException wrapping the supplied throwable.
- tokenize(CharSequence) - Method in interface org.tribuo.util.tokens.Tokenizer
-
Uses this tokenizer to tokenize a string and return the list of tokens that were generated.
- Tokenizer - Interface in org.tribuo.util.tokens
-
An interface for things that tokenize text: breaking it into words according to some set of rules.
- TokenizerOptions - Interface in org.tribuo.util.tokens.options
-
CLI Options for creating a tokenizer.
- tokenType - Variable in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
The token type.
- toString() - Method in class org.tribuo.util.tokens.Token
- toString() - Method in class org.tribuo.util.tokens.universal.Range
- type - Variable in class org.tribuo.util.tokens.Token
-
The token type.
- type - Variable in class org.tribuo.util.tokens.universal.Range
-
The current token type.
U
- UNIVERSAL - Enum constant in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Creates a
UniversalTokenizer. - UniversalTokenizer - Class in org.tribuo.util.tokens.universal
-
This class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine).
- UniversalTokenizer() - Constructor for class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Constructs a universal tokenizer which doesn't send punctuation.
- UniversalTokenizer(boolean) - Constructor for class org.tribuo.util.tokens.universal.UniversalTokenizer
-
Constructs a universal tokenizer.
- UNKNOWN - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
Some tokenizers may work in concert with vocabulary data.
V
- valueOf(String) - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum org.tribuo.util.tokens.Token.TokenType
-
Returns the enum constant of this type with the specified name.
- values() - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitResult
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.tribuo.util.tokens.impl.SplitFunctionTokenizer.SplitType
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.tribuo.util.tokens.options.CoreTokenizerOptions.CoreTokenizerType
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum org.tribuo.util.tokens.Token.TokenType
-
Returns an array containing the constants of this enum type, in the order they are declared.
W
- WHITESPACE - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
Some tokenizers may produce tokens corresponding to whitespace (e.g.
- whitespaceSplitCharacterFunction - Static variable in class org.tribuo.util.tokens.impl.WhitespaceTokenizer
-
The splitting function for whitespace, using
Character.isWhitespace(char). - WhitespaceTokenizer - Class in org.tribuo.util.tokens.impl
-
A simple tokenizer that splits on whitespace.
- WhitespaceTokenizer() - Constructor for class org.tribuo.util.tokens.impl.WhitespaceTokenizer
-
Constructs a tokenizer that splits on whitespace.
- WORD - Enum constant in enum org.tribuo.util.tokens.Token.TokenType
-
A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary.
- wordpiece(String) - Method in class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
Executes Wordpiece tokenization on the provided token.
- Wordpiece - Class in org.tribuo.util.tokens.impl.wordpiece
-
This is vanilla implementation of the Wordpiece algorithm as found here: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py
- Wordpiece(String) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
Constructs a wordpiece by reading the vocabulary from the supplied path.
- Wordpiece(String, String, int) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
- Wordpiece(Set<String>) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
Constructs a Wordpiece using the supplied vocab.
- Wordpiece(Set<String>, String) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
Constructs a Wordpiece using the supplied vocabulary and unknown token.
- Wordpiece(Set<String>, String, int) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.Wordpiece
-
Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
- WordpieceBasicTokenizer - Class in org.tribuo.util.tokens.impl.wordpiece
-
This is a tokenizer that is used "upstream" of
WordpieceTokenizerand implements much of the functionality of the 'BasicTokenizer' implementation in huggingface. - WordpieceBasicTokenizer() - Constructor for class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
-
Constructs a default tokenizer which tokenizes Chinese characters.
- WordpieceBasicTokenizer(boolean) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
-
Constructs a tokenizer.
- WordpieceTokenizer - Class in org.tribuo.util.tokens.impl.wordpiece
-
This Tokenizer is meant to be a reasonable approximation of the BertTokenizer defined here.
- WordpieceTokenizer(Wordpiece, Tokenizer, boolean, boolean, Set<String>) - Constructor for class org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
-
Constructs a wordpiece tokenizer.
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form