- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class CountWikipediaPages
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool
Tool for counting the number of pages in a particular Wikipedia XML dump file. This program keeps
track of total number of pages, redirect pages, disambiguation pages, empty pages, actual
articles (including stubs), stubs, and non-articles ("File:", "Category:", "Wikipedia:", etc.).
This also provides a skeleton for MapReduce programs to process the collection. Specify input
path to the Wikipedia XML dump file with the -input flag.
- Author:
- Jimmy Lin, Peter Exner