public final class DocumentProcessor extends Object
StringTuples.The
SequenceFile input should have a Text key
containing the unique document identifier and a
Text value containing the whole document. The document should be stored in UTF-8 encoding which is
recognizable by hadoop. It uses the given Analyzer to process the document into
Tokens.| Modifier and Type | Field and Description |
|---|---|
static String |
ANALYZER_CLASS |
static String |
TOKENIZED_DOCUMENT_OUTPUT_FOLDER |
| Modifier and Type | Method and Description |
|---|---|
static void |
tokenizeDocuments(org.apache.hadoop.fs.Path input,
Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf)
Convert the input documents into token array using the
StringTuple The input documents has to be
in the SequenceFile format |
public static final String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
public static final String ANALYZER_CLASS
public static void tokenizeDocuments(org.apache.hadoop.fs.Path input,
Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf)
throws IOException,
InterruptedException,
ClassNotFoundException
StringTuple The input documents has to be
in the SequenceFile formatinput - input directory of the documents in SequenceFile formatoutput - output directory were the StringTuple token array of each document has to be createdanalyzerClass - The Lucene Analyzer for tokenizing the UTF-8 textIOExceptionInterruptedExceptionClassNotFoundExceptionCopyright © 2008–2017 The Apache Software Foundation. All rights reserved.