Package ai.djl.basicdataset.nlp
Class StanfordQuestionAnsweringDataset
- java.lang.Object
-
- ai.djl.training.dataset.RandomAccessDataset
-
- ai.djl.basicdataset.nlp.TextDataset
-
- ai.djl.basicdataset.nlp.StanfordQuestionAnsweringDataset
-
- All Implemented Interfaces:
RawDataset<java.lang.Object>,ai.djl.training.dataset.Dataset
public class StanfordQuestionAnsweringDataset extends TextDataset implements RawDataset<java.lang.Object>
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classStanfordQuestionAnsweringDataset.BuilderA builder for aStanfordQuestionAnsweringDataset.-
Nested classes/interfaces inherited from class ai.djl.basicdataset.nlp.TextDataset
TextDataset.Sample
-
-
Field Summary
-
Fields inherited from class ai.djl.basicdataset.nlp.TextDataset
manager, mrl, prepared, samples, sourceTextData, targetTextData, usage
-
-
Constructor Summary
Constructors Modifier Constructor Description protectedStanfordQuestionAnsweringDataset(StanfordQuestionAnsweringDataset.Builder builder)Creates a new instance ofStanfordQuestionAnsweringDataset.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected longavailableSize()Returns the number of records available to be read in thisDataset.static StanfordQuestionAnsweringDataset.Builderbuilder()Creates a new builder to build aStanfordQuestionAnsweringDataset.ai.djl.training.dataset.Recordget(ai.djl.ndarray.NDManager manager, long index)Gets theRecordfor the given index from the dataset.java.lang.ObjectgetData()Get data from the SQuAD dataset.voidprepare(ai.djl.util.Progress progress)Prepares the dataset for use with tracked progress.protected voidpreprocess(java.util.List<java.lang.String> newTextData, boolean source)Performs pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings.-
Methods inherited from class ai.djl.basicdataset.nlp.TextDataset
getProcessedText, getRawText, getSamples, getTextEmbedding, getVocabulary
-
Methods inherited from class ai.djl.training.dataset.RandomAccessDataset
getData, getData, getData, getData, randomSplit, size, subDataset, toArray
-
-
-
-
Constructor Detail
-
StanfordQuestionAnsweringDataset
protected StanfordQuestionAnsweringDataset(StanfordQuestionAnsweringDataset.Builder builder)
Creates a new instance ofStanfordQuestionAnsweringDataset.- Parameters:
builder- the builder object to build from
-
-
Method Detail
-
builder
public static StanfordQuestionAnsweringDataset.Builder builder()
Creates a new builder to build aStanfordQuestionAnsweringDataset.- Returns:
- a new builder
-
prepare
public void prepare(ai.djl.util.Progress progress) throws java.io.IOException, ai.djl.modality.nlp.embedding.EmbeddingExceptionPrepares the dataset for use with tracked progress. In this method the JSON file will be parsed. The question, context, title will be added tosourceTextDataand the answers will be added totargetTextData. Both of them will then be preprocessed.- Specified by:
preparein interfaceai.djl.training.dataset.Dataset- Parameters:
progress- the progress tracker- Throws:
java.io.IOException- for various exceptions depending on the datasetai.djl.modality.nlp.embedding.EmbeddingException- if there are exceptions during the embedding process
-
get
public ai.djl.training.dataset.Record get(ai.djl.ndarray.NDManager manager, long index)Gets theRecordfor the given index from the dataset.- Specified by:
getin classai.djl.training.dataset.RandomAccessDataset- Parameters:
manager- the manager used to create the arraysindex- the index of the requested data item- Returns:
- a
Recordthat contains the data and label of the requested data item. The dataNDListcontains threeNDArrays representing the embedded title, context and question, which are named accordingly. The labelNDListcontains multipleNDArrays corresponding to each embedded answer.
-
availableSize
protected long availableSize()
Returns the number of records available to be read in thisDataset. In this implementation, the actual size of available records are the size ofquestionInfoList.- Specified by:
availableSizein classai.djl.training.dataset.RandomAccessDataset- Returns:
- the number of records available to be read in this
Dataset
-
getData
public java.lang.Object getData() throws java.io.IOExceptionGet data from the SQuAD dataset. This method will directly return the whole dataset as an object- Specified by:
getDatain interfaceRawDataset<java.lang.Object>- Returns:
- an object of
Objectclass in the structure of JSON, e.g.Map<String, List<Map<...>>> - Throws:
java.io.IOException- when IO operation fails in loading a resource
-
preprocess
protected void preprocess(java.util.List<java.lang.String> newTextData, boolean source) throws ai.djl.modality.nlp.embedding.EmbeddingExceptionPerforms pre-processing steps on text data such as tokenising, applyingTextProcessors, creating vocabulary, and word embeddings. Since the record number in this dataset is not equivalent to the length ofsourceTextDataandtargetTextData, the limit should be processed.- Overrides:
preprocessin classTextDataset- Parameters:
newTextData- list of all unprocessed sentences in the datasetsource- whether the text data provided is source or target- Throws:
ai.djl.modality.nlp.embedding.EmbeddingException- if there is an error while embedding input
-
-