public class ExtractHTMLFieldCollection extends PowerTool
Tool for generating 'per-field' collections from HTML documents. The output of this
tool is a new collection, in TREC format (in the form of a SequenceFile
| Modifier and Type | Class and Description |
|---|---|
static class |
ExtractHTMLFieldCollection.MyMapper |
| Modifier and Type | Field and Description |
|---|---|
static String[] |
RequiredParameters |
| Constructor and Description |
|---|
ExtractHTMLFieldCollection(org.apache.hadoop.conf.Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
String[] |
getRequiredParameters() |
static void |
main(String[] args) |
static void |
recursivelyAddInputPaths(org.apache.hadoop.mapreduce.Job job,
String path) |
int |
runTool() |
public static final String[] RequiredParameters
public ExtractHTMLFieldCollection(org.apache.hadoop.conf.Configuration conf)
public String[] getRequiredParameters()
getRequiredParameters in class PowerToolpublic int runTool()
throws Exception
public static void recursivelyAddInputPaths(org.apache.hadoop.mapreduce.Job job,
String path)
throws IOException
IOExceptionCopyright © 2015. All rights reserved.