ngrams(5) manual page

Introduction

Autocorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. For example, it provides the full set of tools to translate the entire English Wikipedia from a 30+GB XML file to a clean n-gram language model, all in a matter of a few hours.

This document describes how to accomplish common tasks using the Autocorpus tools.

Plaintext Wikipedia

Autocorpus includes two tools that enable extraction and conversion to plaintext of Wikipedia articles: wiki-articles and wiki-textify.

wiki-articles reads Wikipedia XML databases (http://dumps.wikimedia.org/enwiki/) and extracts article markup. The extracted markup is written to standard output, with articles delimited by page feeds \f.

wiki-textify reads the output produced by wiki-articles and removes MediaWiki markup. The result is a plaintext version of Wikipedia articles.

The two programs are intended to be used in a pipeline. For example, if you have the June 2011 Wikipedia database you can convert it to plaintext as follows:

pv enwiki-20110620-pages-articles.xml | wiki-articles | wiki-textify -h > \
wikipedia-plaintext.txt

The output will then be saved in the file wikipedia-plaintext.txt (if pv is not available on your system, you can use cat instead).

The optional flag -h suppresses section headings in the output.

Cleaning Up Text

Autocorpus includes utilities for cleaning up text: sentences and tokenize.

sentences splits its input into separate sentences, one per output line. It also splits paragraphs by inserting an extra linebreak between the last and first sentences of consecutive paragraphs.

tokenize normalizes words within sentences by downcasing all characters and making sure words are separated by exactly one space. By default it also ignores text within parentheses.

For example, to clean up the text file wikipedia-plaintext.txt and save it as wikipedia-clean.txt, pipe it through both sentences and tokenize:

pv wikipedia-plaintext.txt | sentences | tokenize > \
wikipedia-clean.txt

Counting Ngrams

To count ngrams in a text file, clean it up first (see section above) and then pipe it to the ngrams utility. For example, the command below will count bigrams in the file wikipedia-clean.txt and save the result in bigrams.txt:

pv wikipedia-clean.txt | ngrams -n 2 > bigrams.txt

The ngrams produced by the ngrams utility will appear in a random order. To sort them by decreasing counts (most frequent ngrams first), pipe the output of ngrams to ngrams-sort:

pv wikipedia-clean.txt | ngrams -n 2 | ngrams-sort > bigrams.txt

Filtering Ngrams

ngrams-freq-filter lets you filter out ngrams with counts below a specified threshold. This will allow you, among other things, to remove some noise from your language model.

For example, to filter out ngrams with counts below 5 in the file bigrams.txt, run:

pv bigrams.txt | ngrams-freq-filter -t 5

Limitations: Unicode

Currently, autocorpus only has a limited support for unicode (UTF-8). While unicode input and output should mostly work, some characters might not be decoded correctly.

If you encounter a Unicode issue with any of the autocorpus utilities, an easy workaround is to use an ASCII converter, e.g. uni2ascii.

Before passing input data to autocorpus tools, pipe it to uni2ascii first and then convert it back to Unicode with ascii2uni when you’re done. For example:

pv enwiki-20110620-pages-articles.xml | uni2ascii | wiki-articles | \
wiki-textify -h | ascii2uni > wikipedia-plaintext.txt

Limitations: Non-English Languages

Current release of autocorpus was designed for English corpora and your results with other languages may vary. In particular, the sentences and tokenize utilities may have trouble breaking sentences and cleaning up punctuation in languages other than English.

Author

Autocorpus was written by Maciej Pacula (maciej.pacula@gmail.com).

The project website is http://mpacula.com/autocorpus