Table of Contents


wiki-articles - extract articles from a Wikipedia XML database.


wiki-articles [-d directory]


The wiki-articles utility reads a Wikipedia XML database of articles from standard input and extracts MediaWiki markup for each article. The output is printed to standard output, or, if the -d parameter is specified, to files in a directory.

wiki-articles can work with very large databases (tens of gigabytes) without running our of memory.

If output directory is not specified, articles are printed to standard output and delimited by page feeds (C character ’\f’).


-d the otuput directory
If set, articles will be saved in the directory with each article in a separate file. The files will be named after the articles.



cat ../wikipedia/data/enwiki-20110620-pages-articles.xml | wiki-articles
Output (truncated):

{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}
{{Good article}}
{{Anarchism sidebar}}
’’’Anarchism’’’ is a [[political philosophy]] 


Autocorpus was written by Maciej Pacula (

The project website is

See Also

autocorpus(7) , ngrams(1) , ngrams(5) , ngrams-freq-filter(1) , ngrams-sort(1) , sentences(1) , tokenize(1) , wiki-textify(1) ,

Table of Contents