public class CountClueWarcRecords
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool
Simple demo program to count the number of records in the ClueWeb09 collection, from either the original source WARC files or repacked SequenceFiles. Sample invocations:
hadoop jar dist/cloud9-X.X.X.jar edu.umd.cloud9.collection.clue.CountClueWarcRecords \ -libjars lib/guava-X.X.X.jar \ -original -path /shared/collections/ClueWeb09/collection.raw/ -segment 1 \ -docnoMapping /shared/collections/ClueWeb09/docno-mapping.dat -countOutput records.txt hadoop jar dist/cloud9-X.X.X.jar edu.umd.cloud9.collection.clue.CountClueWarcRecords \ -libjars lib/guava-X.X.X.jar \ -repacked -path /shared/collections/ClueWeb09/collection.compressed.block/en.01 \ -docnoMapping /shared/collections/ClueWeb09/docno-mapping.dat -countOutput records.txt
| Modifier and Type | Field and Description |
|---|---|
static String |
COUNT_OPTION |
static String |
MAPPING_OPTION |
static String |
ORIGINAL_OPTION |
static String |
PATH_OPTION |
static String |
REPACKED_OPTION |
static String |
SEGMENT_OPTION |
| Constructor and Description |
|---|
CountClueWarcRecords() |
| Modifier and Type | Method and Description |
|---|---|
static void |
main(String[] args)
Dispatches command-line arguments to the tool via the
ToolRunner. |
int |
run(String[] args)
Runs this tool.
|
public static final String ORIGINAL_OPTION
public static final String REPACKED_OPTION
public static final String PATH_OPTION
public static final String MAPPING_OPTION
public static final String SEGMENT_OPTION
public static final String COUNT_OPTION
Copyright © 2015. All rights reserved.