Table of Contents

Name

wiki-articles - extract articles from a Wikipedia XML database.

Synopsis

wiki-articles [-d directory]

Description

The wiki-articles utility reads a Wikipedia XML database of articles from standard input and extracts MediaWiki markup for each article. The output is printed to standard output, or, if the -d parameter is specified, to files in a directory.

wiki-articles can work with very large databases (tens of gigabytes) without running our of memory.

If output directory is not specified, articles are printed to standard output and delimited by page feeds (C character ’\f’).

Options

-d the otuput directory
If set, articles will be saved in the directory with each article in a separate file. The files will be named after the articles.

Examples

Command:

cat ../wikipedia/data/enwiki-20110620-pages-articles.xml | wiki-articles
Output (truncated):

{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}
{{Redirect|Anarchists}}
{{Good article}}
{{pp-move-indef}}
{{Anarchism sidebar}}
’’’Anarchism’’’ is a [[political philosophy]] 

Author

Autocorpus was written by Maciej Pacula (maciej.pacula@gmail.com).

The project website is http://mpacula.com/autocorpus

See Also

autocorpus(7) , ngrams(1) , ngrams(5) , ngrams-freq-filter(1) , ngrams-sort(1) , sentences(1) , tokenize(1) , wiki-textify(1) ,


Table of Contents