public final class TFIDFConverter extends Object
WritableComparable key containing and a
VectorWritable value containing the
term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf
format| Modifier and Type | Field and Description |
|---|---|
static String |
FEATURE_COUNT |
static String |
FREQUENCY_FILE |
static String |
MAX_DF |
static String |
MIN_DF |
static String |
VECTOR_COUNT |
static String |
WORDCOUNT_OUTPUT_FOLDER |
| Modifier and Type | Method and Description |
|---|---|
static Pair<Long[],List<org.apache.hadoop.fs.Path>> |
calculateDF(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int chunkSizeInMegabytes)
Calculates the document frequencies of all terms from the input set of vectors in
SequenceFile format. |
static void |
processTfIdf(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures,
int minDf,
long maxDF,
float normPower,
boolean logNormalize,
boolean sequentialAccessOutput,
boolean namedVector,
int numReducers)
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile format. |
public static final String VECTOR_COUNT
public static final String FEATURE_COUNT
public static final String MIN_DF
public static final String MAX_DF
public static final String FREQUENCY_FILE
public static final String WORDCOUNT_OUTPUT_FOLDER
public static void processTfIdf(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures,
int minDf,
long maxDF,
float normPower,
boolean logNormalize,
boolean sequentialAccessOutput,
boolean namedVector,
int numReducers)
throws IOException,
InterruptedException,
ClassNotFoundException
SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.
Before using this method calculateDF should be calledinput - input directory of the vectors in SequenceFile formatoutput - output directory where RandomAccessSparseVector's of the document
are generateddatasetFeatures - Document frequencies information calculated by calculateDFminDf - The minimum document frequency. Default 1maxDF - The max percentage of vectors for the DF. Can be used to remove really high frequency features.
Expressed as an integer between 0 and 100. Default 99numReducers - The number of reducers to spawn. This also affects the possible parallelism since each reducer
will typically produce a single output file containing tf-idf vectors for a subset of the
documents in the corpus.IOExceptionInterruptedExceptionClassNotFoundExceptionpublic static Pair<Long[],List<org.apache.hadoop.fs.Path>> calculateDF(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, int chunkSizeInMegabytes) throws IOException, InterruptedException, ClassNotFoundException
SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.input - input directory of the vectors in SequenceFile formatoutput - output directory where document frequencies will be storedchunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swappingIOExceptionInterruptedExceptionClassNotFoundExceptionCopyright © 2008–2017 The Apache Software Foundation. All rights reserved.