Class TextDataset

  • All Implemented Interfaces:
    ai.djl.training.dataset.Dataset
    Direct Known Subclasses:
    PennTreebankText, StanfordMovieReview, StanfordQuestionAnsweringDataset, TatoebaEnglishFrenchDataset

    public abstract class TextDataset
    extends ai.djl.training.dataset.RandomAccessDataset
    TextDataset is an abstract dataset that can be used for datasets for natural language processing where either the source or target are text-based data.

    The TextDataset fetches the data in the form of String, processes the data as required, and creates embeddings for the tokens. Embeddings can be either pre-trained or trained on the go. Pre-trained TextEmbedding must be set in the TextDataset.Builder. If no embeddings are set, the dataset creates TrainableWordEmbedding based TrainableWordEmbedding from the Vocabulary created within the dataset.

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  TextDataset.Builder<T extends TextDataset.Builder<T>>
      Abstract Builder that helps build a TextDataset.
      static class  TextDataset.Sample
      A class stores TextDataset sample information.
      • Nested classes/interfaces inherited from class ai.djl.training.dataset.RandomAccessDataset

        ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T extends ai.djl.training.dataset.RandomAccessDataset.BaseBuilder<T>>
      • Nested classes/interfaces inherited from interface ai.djl.training.dataset.Dataset

        ai.djl.training.dataset.Dataset.Usage
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected ai.djl.ndarray.NDManager manager  
      protected ai.djl.repository.MRL mrl  
      protected boolean prepared  
      protected java.util.List<TextDataset.Sample> samples  
      protected TextData sourceTextData  
      protected TextData targetTextData  
      protected ai.djl.training.dataset.Dataset.Usage usage  
      • Fields inherited from class ai.djl.training.dataset.RandomAccessDataset

        dataBatchifier, device, labelBatchifier, limit, pipeline, prefetchNumber, sampler, targetPipeline
    • Constructor Summary

      Constructors 
      Constructor Description
      TextDataset​(TextDataset.Builder<?> builder)
      Creates a new instance of RandomAccessDataset with the given necessary configurations.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.util.List<java.lang.String> getProcessedText​(long index, boolean source)
      Gets the processed textual input.
      java.lang.String getRawText​(long index, boolean source)
      Gets the raw textual input.
      java.util.List<TextDataset.Sample> getSamples()
      Returns a list of sample information.
      ai.djl.modality.nlp.embedding.TextEmbedding getTextEmbedding​(boolean source)
      Gets the word embedding used while pre-processing the dataset.
      ai.djl.modality.nlp.Vocabulary getVocabulary​(boolean source)
      Gets the DefaultVocabulary built while preprocessing the text data.
      protected void preprocess​(java.util.List<java.lang.String> newTextData, boolean source)
      Performs pre-processing steps on text data such as tokenising, applying TextProcessors, creating vocabulary, and word embeddings.
      • Methods inherited from class ai.djl.training.dataset.RandomAccessDataset

        availableSize, get, getData, getData, getData, getData, randomSplit, size, subDataset, toArray
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface ai.djl.training.dataset.Dataset

        prepare, prepare
    • Field Detail

      • sourceTextData

        protected TextData sourceTextData
      • targetTextData

        protected TextData targetTextData
      • manager

        protected ai.djl.ndarray.NDManager manager
      • usage

        protected ai.djl.training.dataset.Dataset.Usage usage
      • mrl

        protected ai.djl.repository.MRL mrl
      • prepared

        protected boolean prepared
    • Constructor Detail

      • TextDataset

        public TextDataset​(TextDataset.Builder<?> builder)
        Creates a new instance of RandomAccessDataset with the given necessary configurations.
        Parameters:
        builder - a builder with the necessary configurations
    • Method Detail

      • getTextEmbedding

        public ai.djl.modality.nlp.embedding.TextEmbedding getTextEmbedding​(boolean source)
        Gets the word embedding used while pre-processing the dataset. This method must be called after preprocess has been called on this instance.
        Parameters:
        source - whether to get source or target text embedding
        Returns:
        the text embedding
      • getVocabulary

        public ai.djl.modality.nlp.Vocabulary getVocabulary​(boolean source)
        Gets the DefaultVocabulary built while preprocessing the text data.
        Parameters:
        source - whether to get source or target vocabulary
        Returns:
        the DefaultVocabulary
      • getRawText

        public java.lang.String getRawText​(long index,
                                           boolean source)
        Gets the raw textual input.
        Parameters:
        index - the index of the text input
        source - whether to get text from source or target
        Returns:
        the raw text
      • getProcessedText

        public java.util.List<java.lang.String> getProcessedText​(long index,
                                                                 boolean source)
        Gets the processed textual input.
        Parameters:
        index - the index of the text input
        source - whether to get text from source or target
        Returns:
        the processed text
      • getSamples

        public java.util.List<TextDataset.Sample> getSamples()
        Returns a list of sample information.
        Returns:
        a list of sample information
      • preprocess

        protected void preprocess​(java.util.List<java.lang.String> newTextData,
                                  boolean source)
                           throws ai.djl.modality.nlp.embedding.EmbeddingException
        Performs pre-processing steps on text data such as tokenising, applying TextProcessors, creating vocabulary, and word embeddings.
        Parameters:
        newTextData - list of all unprocessed sentences in the dataset
        source - whether the text data provided is source or target
        Throws:
        ai.djl.modality.nlp.embedding.EmbeddingException - if there is an error while embedding input