public abstract class TextDataset
extends ai.djl.training.dataset.RandomAccessDataset
TextDataset is an abstract dataset that can be used for datasets for natural language
processing where either the source or target are text-based data.
The TextDataset fetches the data in the form of String, processes the data as
required, and creates embeddings for the tokens. Embeddings can be either pre-trained or trained
on the go. Pre-trained TextEmbedding must be set in the TextDataset.Builder. If no embeddings
are set, the dataset creates TrainableWordEmbedding based TrainableWordEmbedding
from the Vocabulary created within the dataset.
| Modifier and Type | Class and Description |
|---|---|
static class |
TextDataset.Builder<T extends TextDataset.Builder<T>>
Abstract Builder that helps build a
TextDataset. |
static class |
TextDataset.Sample
A class stores
TextDataset sample information. |
| Modifier and Type | Field and Description |
|---|---|
protected ai.djl.ndarray.NDManager |
manager |
protected boolean |
prepared |
protected ai.djl.repository.Resource |
resource |
protected java.util.List<TextDataset.Sample> |
samples |
protected TextData |
sourceTextData |
protected TextData |
targetTextData |
protected ai.djl.training.dataset.Dataset.Usage |
usage |
| Constructor and Description |
|---|
TextDataset(TextDataset.Builder<?> builder)
Creates a new instance of
RandomAccessDataset with the given necessary
configurations. |
| Modifier and Type | Method and Description |
|---|---|
java.util.List<java.lang.String> |
getProcessedText(long index,
boolean source)
Gets the processed textual input.
|
java.lang.String |
getRawText(long index,
boolean source)
Gets the raw textual input.
|
java.util.List<TextDataset.Sample> |
getSamples()
Returns a list of sample information.
|
ai.djl.modality.nlp.embedding.TextEmbedding |
getTextEmbedding(boolean source)
Gets the word embedding used while pre-processing the dataset.
|
ai.djl.modality.nlp.Vocabulary |
getVocabulary(boolean source)
Gets the
SimpleVocabulary built while preprocessing the text data. |
protected void |
preprocess(java.util.List<java.lang.String> newTextData,
boolean source)
Performs pre-processing steps on text data such as tokenising, applying
TextProcessors, creating vocabulary, and word embeddings. |
availableSize, get, getData, getData, getData, getData, randomSplit, size, subDataset, toArrayprotected TextData sourceTextData
protected TextData targetTextData
protected ai.djl.ndarray.NDManager manager
protected ai.djl.training.dataset.Dataset.Usage usage
protected ai.djl.repository.Resource resource
protected boolean prepared
protected java.util.List<TextDataset.Sample> samples
public TextDataset(TextDataset.Builder<?> builder)
RandomAccessDataset with the given necessary
configurations.builder - a builder with the necessary configurationspublic ai.djl.modality.nlp.embedding.TextEmbedding getTextEmbedding(boolean source)
source - whether to get source or target text embeddingpublic ai.djl.modality.nlp.Vocabulary getVocabulary(boolean source)
SimpleVocabulary built while preprocessing the text data.source - whether to get source or target vocabularySimpleVocabularypublic java.lang.String getRawText(long index,
boolean source)
index - the index of the text inputsource - whether to get text from source or targetpublic java.util.List<java.lang.String> getProcessedText(long index,
boolean source)
index - the index of the text inputsource - whether to get text from source or targetpublic java.util.List<TextDataset.Sample> getSamples()
protected void preprocess(java.util.List<java.lang.String> newTextData,
boolean source)
throws ai.djl.modality.nlp.embedding.EmbeddingException
TextProcessors, creating vocabulary, and word embeddings.newTextData - list of all unprocessed sentences in the datasetsource - whether the text data provided is source or targetai.djl.modality.nlp.embedding.EmbeddingException - if there is an error while embedding input