This document describes how to accomplish common tasks using the Autocorpus tools.
wiki-articles reads Wikipedia XML databases (http://dumps.wikimedia.org/enwiki/) and extracts article markup. The extracted markup is written to standard output, with articles delimited by page feeds \f.
wiki-textify reads the output produced by wiki-articles and removes MediaWiki markup. The result is a plaintext version of Wikipedia articles.
The two programs are intended to be used in a pipeline. For example,
if you have the June 2011 Wikipedia database you can convert it to plaintext
as follows:
pv enwiki-20110620-pages-articles.xml | wiki-articles | wiki-textify -h > \ wikipedia-plaintext.txt
The output will then be saved in the file wikipedia-plaintext.txt (if pv is not available on your system, you can use cat instead).
The optional flag -h suppresses section headings in the output.
sentences splits its input into separate sentences, one per output line. It also splits paragraphs by inserting an extra linebreak between the last and first sentences of consecutive paragraphs.
tokenize normalizes words within sentences by downcasing all characters and making sure words are separated by exactly one space. By default it also ignores text within parentheses.
For example, to clean up the text file wikipedia-plaintext.txt and save
it as wikipedia-clean.txt, pipe it through both sentences and tokenize:
pv wikipedia-plaintext.txt | sentences | tokenize > \ wikipedia-clean.txt
To count ngrams in a text file, clean it up first (see
section above) and then pipe it to the ngrams utility. For example, the
command below will count bigrams in the file wikipedia-clean.txt and save
the result in bigrams.txt:
pv wikipedia-clean.txt | ngrams -n 2 > bigrams.txt
The ngrams produced by the ngrams utility will appear in a random order.
To sort them by decreasing counts (most frequent ngrams first), pipe the
output of ngrams to ngrams-sort:
pv wikipedia-clean.txt | ngrams -n 2 | ngrams-sort > bigrams.txt
For example, to filter out
ngrams with counts below 5 in the file bigrams.txt, run:
pv bigrams.txt | ngrams-freq-filter -t 5
If you encounter a Unicode issue with any of the autocorpus utilities, an easy workaround is to use an ASCII converter, e.g. uni2ascii.
Before passing input data to autocorpus tools,
pipe it to uni2ascii first and then convert it back to Unicode with ascii2uni
when you’re done. For example:
pv enwiki-20110620-pages-articles.xml | uni2ascii | wiki-articles | \ wiki-textify -h | ascii2uni > wikipedia-plaintext.txt
The project website is http://mpacula.com/autocorpus