What is AutoCorpus?

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets such as Wikipedia. It consists of fast native binaries which can process tens of gigabytes of text in little time.

AutoCorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines. If you like this project and would like to support its development, please consider donating:

What can I do with it?

AutoCorpus enables you to convert the entire English Wikipedia from a 30+GB XML database into a clean plaintext version and an n-gram language model, in only a few hours, on a laptop. You could also clean up and generate language models for other text sources, like Project Gutenberg, with minimal work.

How do I use it?

Each AutoCorpus tool has a manpage (see below). Additionally, you can see how to combine the tools to accomplish common tasks at autocorpus (7).

ngrams-freq-filter manpage
ngrams-sort manpage
ngrams manpage
sentences manpage
tokenize manpage
wiki-articles manpage
wiki-textify manpage

Is AutoCorpus free software?

Yes. AutoCorpus is distributed under the terms of the Affero General Public License v3. The source code is available on GitHub at https://github.com/mpacula/AutoCorpus. You can also use GitHub to submit feature requests and bug reports.

If you would like to use AutoCorpus in a proprietary product, please at contact the author and inquire about a commercial license.

Releases

The current stable version is 1.0.1 and was released on November 24, 2011.

Debian/Ubuntu i386 autocorpus_1.0.1-1_i386.deb
amd64 autocorpus_1.0.1-1_amd64.deb
Generic Linux i386 autocorpus-1.0.1-i686.tar.gz
amd64 autocorpus-1.0.1-x86_64.tar.gz
Source all autocorpus-1.0.1.tar.gz

Download plaintext Wikipedia and n-grams

For your convenience, language models and a cleaned-up, plaintext version of Wikipedia are available for download. All the datasets were generated using the latest version of AutoCorpus. If a particular download is not available below, it is due to file size and hosting restrictions. Please contact me for a download link. When viewing the files, you might need to manually set your editor's encoding to UTF-8.

I can also ship you a USB stick with all the data (roughly 35GB uncompressed) for a small fee. Again, shoot the author an email if interested.

All Wikipedia datasets are distributed under the Creative Commons Attribution - ShareAlike 3.0 Unported License.

Wikipedia plaintext sample download
unigrams sample download
bigrams sample download
trigrams sample download