Package ai.djl.modality.nlp.bert
Class BertFullTokenizer
- java.lang.Object
-
- ai.djl.modality.nlp.preprocess.SimpleTokenizer
-
- ai.djl.modality.nlp.bert.BertTokenizer
-
- ai.djl.modality.nlp.bert.BertFullTokenizer
-
- All Implemented Interfaces:
TextProcessor,Tokenizer
public class BertFullTokenizer extends BertTokenizer
BertFullTokenizer runs end to end tokenization of input textIt will run basic preprocessors to clean the input text and then run
WordpieceTokenizerto split into word pieces.Reference implementation: Google Research Bert Tokenizer
-
-
Constructor Summary
Constructors Constructor Description BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)Creates an instance ofBertFullTokenizer.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.StringbuildSentence(java.util.List<java.lang.String> tokens)Combines a list of tokens to form a sentence.static java.util.List<TextProcessor>getPreprocessors(boolean lowerCase)Get a list ofTextProcessors to process input text for Bert models.VocabularygetVocabulary()Returns theVocabularyused for tokenization.java.util.List<java.lang.String>tokenize(java.lang.String input)Breaks down the given sentence into a list of tokens that can be represented by embeddings.-
Methods inherited from class ai.djl.modality.nlp.bert.BertTokenizer
encode, encode, pad, tokenToString
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
-
-
-
Constructor Detail
-
BertFullTokenizer
public BertFullTokenizer(Vocabulary vocabulary, boolean lowerCase)
Creates an instance ofBertFullTokenizer.- Parameters:
vocabulary- the BERT vocabularylowerCase- whether to convert tokens to lowercase
-
-
Method Detail
-
getVocabulary
public Vocabulary getVocabulary()
Returns theVocabularyused for tokenization.- Returns:
- the
Vocabularyused for tokenization
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String input)
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classBertTokenizer- Parameters:
input- the sentence to tokenize- Returns:
- a
Listof tokens
-
buildSentence
public java.lang.String buildSentence(java.util.List<java.lang.String> tokens)
Combines a list of tokens to form a sentence.- Specified by:
buildSentencein interfaceTokenizer- Overrides:
buildSentencein classSimpleTokenizer- Parameters:
tokens- theListof tokens- Returns:
- the sentence built from the given tokens
-
getPreprocessors
public static java.util.List<TextProcessor> getPreprocessors(boolean lowerCase)
Get a list ofTextProcessors to process input text for Bert models.- Parameters:
lowerCase- whether to convert input to lowercase- Returns:
- List of
TextProcessors
-
-