public class RepackClueWarcRecords
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool
Program to uncompress the ClueWeb09 collection from the original distribution WARC files and
repack as SequenceFiles.
The program takes the following command-line arguments:
Here's a sample invocation:
hadoop jar dist/cloud9-X.X.X.jar edu.umd.cloud9.collection.clue.RepackClueWarcRecords \ /shared/collections/ClueWeb09/collection.raw \ /shared/collections/ClueWeb09/collection.compressed.block/en.01 1 \ /shared/collections/ClueWeb09/docno-mapping.dat block
| Constructor and Description |
|---|
RepackClueWarcRecords()
Creates an instance of this tool.
|
| Modifier and Type | Method and Description |
|---|---|
static void |
main(String[] args)
Dispatches command-line arguments to the tool via the
ToolRunner. |
int |
run(String[] args)
Runs this tool.
|
Copyright © 2015. All rights reserved.