All Classes and Interfaces
Class
Description
A tokenizer wrapping a
BreakIterator instance.CLI options for a
BreakIteratorTokenizer.CLI Options for all the tokenizers in the core package.
Tokenizer type.
A convenience class for when you are required to provide a tokenizer but you
don't actually want to split up the text into tokens.
A range currently being segmented.
This tokenizer is loosely based on the notion of word shape which is a common
feature used in NLP.
This implementation of
Tokenizer is instantiated with an array of
characters that are considered split characters.Splits tokens at the supplied characters.
CLI options for a
SplitCharactersTokenizer.This class supports character-by-character (that is, codepoint-by-codepoint)
iteration over input text to create tokens.
An interface for checking if the text should be split at the supplied codepoint.
A combination of a
SplitFunctionTokenizer.SplitType and a Token.TokenType.Defines different ways that a tokenizer can split the input text at a given character.
This implementation of
Tokenizer is instantiated with a regular
expression pattern which determines how to split a string into tokens.CLI options for a
SplitPatternTokenizer.A single token extracted from a String.
Tokenizers may product multiple kinds of tokens, depending on the application
to which they're being put.
Wraps exceptions thrown by tokenizers.
An interface for things that tokenize text: breaking it into words according
to some set of rules.
CLI Options for creating a tokenizer.
This class was originally written for the purpose of document indexing in an
information retrieval context (principally used in Sun Labs' Minion search
engine).
A simple tokenizer that splits on whitespace.
This is vanilla implementation of the Wordpiece algorithm as found here:
https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py
This is a tokenizer that is used "upstream" of
WordpieceTokenizer and
implements much of the functionality of the 'BasicTokenizer'
implementation in huggingface.This Tokenizer is meant to be a reasonable approximation of the BertTokenizer
defined here.