See: Description
| Class | Description |
|---|---|
| ClueCollectionPathConstants |
Class that provides convenience methods for processing portions of the Clue
Web collection with Hadoop.
|
| ClueWarcDocnoMapping |
Object that maps between WARC-TREC-IDs (String identifiers) to docnos (sequentially-numbered
ints).
|
| ClueWarcDocnoMappingBuilder | |
| ClueWarcForwardIndex | |
| ClueWarcForwardIndexBuilder |
Tool for building a document forward index for the ClueWeb09 collection.
|
| ClueWarcInputFormat | |
| ClueWarcInputFormat.ClueWarcRecordReader | |
| ClueWarcRecord | |
| CountClueWarcRecords |
Simple demo program to count the number of records in the ClueWeb09 collection, from either the
original source WARC files or repacked SequenceFiles.
|
| RepackClueWarcRecords |
Program to uncompress the ClueWeb09 collection from the original distribution WARC files and
repack as
SequenceFiles. |
| ScanBlockCompressedSequenceFile |
Provides classes for working with the ClueWeb09 collection. The dataset consists of one billion web pages (5 TB compressed, 25 TB uncompressed), in ten languages, collected in January and February 2009. Its creation, supported by U.S. National Science Foundation (NSF), was led by Jamie Callan of the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies.
Copyright © 2015. All rights reserved.