Package ai.djl.basicdataset.nlp
Class TextDataset
- java.lang.Object
-
- ai.djl.training.dataset.RandomAccessDataset
-
- ai.djl.basicdataset.nlp.TextDataset
-
- All Implemented Interfaces:
ai.djl.training.dataset.Dataset
- Direct Known Subclasses:
GoEmotions,PennTreebankText,StanfordMovieReview,StanfordQuestionAnsweringDataset,TatoebaEnglishFrenchDataset,UniversalDependenciesEnglishEWT
public abstract class TextDataset extends ai.djl.training.dataset.RandomAccessDatasetTextDatasetis an abstract dataset that can be used for datasets for natural language processing where either the source or target are text-based data.The
TextDatasetfetches the data in the form ofString, processes the data as required, and creates embeddings for the tokens. Embeddings can be either pre-trained or trained on the go. Pre-trainedTextEmbeddingmust be set in theTextDataset.Builder. If no embeddings are set, the dataset createsTrainableWordEmbeddingbasedTrainableWordEmbeddingfrom theVocabularycreated within the dataset.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classTextDataset.Builder<T extends TextDataset.Builder<T>>Abstract Builder that helps build aTextDataset.static classTextDataset.SampleA class storesTextDatasetsample information.
-
Field Summary
Fields Modifier and Type Field Description protected ai.djl.ndarray.NDManagermanagerprotected ai.djl.repository.MRLmrlprotected booleanpreparedprotected java.util.List<TextDataset.Sample>samplesprotected TextDatasourceTextDataprotected TextDatatargetTextDataprotected ai.djl.training.dataset.Dataset.Usageusage
-
Constructor Summary
Constructors Constructor Description TextDataset(TextDataset.Builder<?> builder)Creates a new instance ofRandomAccessDatasetwith the given necessary configurations.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.util.List<java.lang.String>getProcessedText(long index, boolean source)Gets the processed textual input.java.lang.StringgetRawText(long index, boolean source)Gets the raw textual input.java.util.List<TextDataset.Sample>getSamples()Returns a list of sample information.ai.djl.modality.nlp.embedding.TextEmbeddinggetTextEmbedding(boolean source)Gets the word embedding used while pre-processing the dataset.ai.djl.modality.nlp.VocabularygetVocabulary(boolean source)Gets theDefaultVocabularybuilt while preprocessing the text data.protected voidpreprocess(java.util.List<java.lang.String> newTextData, boolean source)Performs pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings.-
Methods inherited from class ai.djl.training.dataset.RandomAccessDataset
availableSize, get, getData, getData, getData, getData, newSubDataset, newSubDataset, randomSplit, size, subDataset, subDataset, subDataset, subDataset, toArray
-
-
-
-
Field Detail
-
sourceTextData
protected TextData sourceTextData
-
targetTextData
protected TextData targetTextData
-
manager
protected ai.djl.ndarray.NDManager manager
-
usage
protected ai.djl.training.dataset.Dataset.Usage usage
-
mrl
protected ai.djl.repository.MRL mrl
-
prepared
protected boolean prepared
-
samples
protected java.util.List<TextDataset.Sample> samples
-
-
Constructor Detail
-
TextDataset
public TextDataset(TextDataset.Builder<?> builder)
Creates a new instance ofRandomAccessDatasetwith the given necessary configurations.- Parameters:
builder- a builder with the necessary configurations
-
-
Method Detail
-
getTextEmbedding
public ai.djl.modality.nlp.embedding.TextEmbedding getTextEmbedding(boolean source)
Gets the word embedding used while pre-processing the dataset. This method must be called after preprocess has been called on this instance.- Parameters:
source- whether to get source or target text embedding- Returns:
- the text embedding
-
getVocabulary
public ai.djl.modality.nlp.Vocabulary getVocabulary(boolean source)
Gets theDefaultVocabularybuilt while preprocessing the text data.- Parameters:
source- whether to get source or target vocabulary- Returns:
- the
DefaultVocabulary
-
getRawText
public java.lang.String getRawText(long index, boolean source)Gets the raw textual input.- Parameters:
index- the index of the text inputsource- whether to get text from source or target- Returns:
- the raw text
-
getProcessedText
public java.util.List<java.lang.String> getProcessedText(long index, boolean source)Gets the processed textual input.- Parameters:
index- the index of the text inputsource- whether to get text from source or target- Returns:
- the processed text
-
getSamples
public java.util.List<TextDataset.Sample> getSamples()
Returns a list of sample information.- Returns:
- a list of sample information
-
preprocess
protected void preprocess(java.util.List<java.lang.String> newTextData, boolean source) throws ai.djl.modality.nlp.embedding.EmbeddingExceptionPerforms pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings.- Parameters:
newTextData- list of all unprocessed sentences in the datasetsource- whether the text data provided is source or target- Throws:
ai.djl.modality.nlp.embedding.EmbeddingException- if there is an error while embedding input
-
-