All Classes and Interfaces

Class
Description
A tokenizer wrapping a BreakIterator instance.
CLI options for a BreakIteratorTokenizer.
CLI Options for all the tokenizers in the core package.
Tokenizer type.
A convenience class for when you are required to provide a tokenizer but you don't actually want to split up the text into tokens.
A range currently being segmented.
This tokenizer is loosely based on the notion of word shape which is a common feature used in NLP.
This implementation of Tokenizer is instantiated with an array of characters that are considered split characters.
Splits tokens at the supplied characters.
CLI options for a SplitCharactersTokenizer.
This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens.
An interface for checking if the text should be split at the supplied codepoint.
Defines different ways that a tokenizer can split the input text at a given character.
This implementation of Tokenizer is instantiated with a regular expression pattern which determines how to split a string into tokens.
CLI options for a SplitPatternTokenizer.
A single token extracted from a String.
Tokenizers may product multiple kinds of tokens, depending on the application to which they're being put.
Wraps exceptions thrown by tokenizers.
An interface for things that tokenize text: breaking it into words according to some set of rules.
CLI Options for creating a tokenizer.
This class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine).
A simple tokenizer that splits on whitespace.
This is vanilla implementation of the Wordpiece algorithm as found here: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py
This is a tokenizer that is used "upstream" of WordpieceTokenizer and implements much of the functionality of the 'BasicTokenizer' implementation in huggingface.
This Tokenizer is meant to be a reasonable approximation of the BertTokenizer defined here.