public final class DictionaryVectorizer extends AbstractJob implements Vectorizer
Text key containing the unique document identifier and a StringTuple
value containing the tokenized document. You may use DocumentProcessor to tokenize the document.
This is a dictionary based Vectorizer.| Modifier and Type | Field and Description |
|---|---|
static int |
DEFAULT_MIN_SUPPORT |
static String |
DICTIONARY_FILE |
static String |
DOCUMENT_VECTOR_OUTPUT_FOLDER |
static String |
MAX_NGRAMS |
static String |
MIN_SUPPORT |
argMap, inputFile, inputPath, outputFile, outputPath, tempPath| Modifier and Type | Method and Description |
|---|---|
static void |
createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
String tfVectorsFolderName,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
Create Term Frequency (Tf) Vectors from the input set of documents in
SequenceFile format. |
void |
createVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
VectorizerConfig config) |
static void |
main(String[] args) |
int |
run(String[] args) |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhasepublic static final String DOCUMENT_VECTOR_OUTPUT_FOLDER
public static final String MIN_SUPPORT
public static final String MAX_NGRAMS
public static final int DEFAULT_MIN_SUPPORT
public static final String DICTIONARY_FILE
public void createVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
VectorizerConfig config)
throws IOException,
ClassNotFoundException,
InterruptedException
createVectors in interface VectorizerIOExceptionClassNotFoundExceptionInterruptedExceptionpublic static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
String tfVectorsFolderName,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
throws IOException,
InterruptedException,
ClassNotFoundException
SequenceFile format. This
tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across
multiple map/reduces.input - input directory of the documents in SequenceFile formatoutput - output directory where RandomAccessSparseVector's of the document
are generatedtfVectorsFolderName - The name of the folder in which the final output vectors will be storedbaseConf - job configurationminSupport - the minimum frequency of the feature in the entire corpus to be considered for inclusion in the
sparse vectormaxNGramSize - 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigramminLLRValue - minValue of log likelihood ratio to used to prune ngramsnormPower - L_p norm to be computedlogNormalize - whether to use log normalizationnumReducers - chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swappingsequentialAccess - namedVectors - IOExceptionInterruptedExceptionClassNotFoundExceptionpublic int run(String[] args) throws Exception
run in interface org.apache.hadoop.util.ToolExceptionCopyright © 2008–2017 The Apache Software Foundation. All rights reserved.