public class BertDataParser
extends java.lang.Object
You can use this utility to parse vocabulary JSON into Java Array and Dictionary, clean and tokenize sentences, and pad the text.
| Constructor and Description |
|---|
BertDataParser() |
| Modifier and Type | Method and Description |
|---|---|
static java.util.List<java.lang.String> |
formTokens(java.util.List<java.lang.String> question,
java.util.List<java.lang.String> answer,
int seqLength)
Forms tokens with separation that can be used for BERT.
|
static java.util.List<java.lang.Float> |
getTokenTypes(java.util.List<java.lang.String> question,
java.util.List<java.lang.String> answer,
int seqLength)
Forms the token types List [0000...1111...000] where all questions are 0 and answers are 1.
|
java.util.List<java.lang.String> |
idx2token(java.util.List<java.lang.Integer> indexes)
Converts indexes to tokens.
|
static <E> java.util.List<E> |
pad(java.util.List<E> tokens,
E padItem,
int num)
Pads the tokens to the required length.
|
static BertDataParser |
parse(java.io.InputStream is)
Parses the Vocabulary to JSON files.
|
java.util.List<java.lang.Integer> |
token2idx(java.util.List<java.lang.String> tokens)
Converts tokens to indexes.
|
static java.util.List<java.lang.String> |
tokenizer(java.lang.String input)
Tokenizes the input, splits all kinds of whitespace, and separates the end of sentence
symbol.
|
public static BertDataParser parse(java.io.InputStream is)
is - the InputStream for the vocab.jsonBertDataParserjava.lang.IllegalStateException - if failed read from InputStreampublic static java.util.List<java.lang.String> tokenizer(java.lang.String input)
input - the input stringpublic static <E> java.util.List<E> pad(java.util.List<E> tokens,
E padItem,
int num)
E - the type of the Listtokens - the input tokenspadItem - the things to pad at the endnum - the total length after paddingpublic static java.util.List<java.lang.Float> getTokenTypes(java.util.List<java.lang.String> question,
java.util.List<java.lang.String> answer,
int seqLength)
question - the question tokensanswer - the answer tokensseqLength - the sequence lengthpublic static java.util.List<java.lang.String> formTokens(java.util.List<java.lang.String> question,
java.util.List<java.lang.String> answer,
int seqLength)
question - the question tokensanswer - the answer tokensseqLength - the sequence lengthpublic java.util.List<java.lang.Integer> token2idx(java.util.List<java.lang.String> tokens)
tokens - the input tokenspublic java.util.List<java.lang.String> idx2token(java.util.List<java.lang.Integer> indexes)
indexes - the list of indexes