AutoCorpus - natural language corpora from large public datasets

What is AutoCorpus?

AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets such as Wikipedia. It consists of fast native binaries which can process tens of gigabytes of text in little time.

AutoCorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines. If you like this project and would like to support its development, please consider donating:

What can I do with it?

AutoCorpus enables you to convert the entire English Wikipedia from a 30+GB XML database into a clean plaintext version and an n-gram language model, in only a few hours, on a laptop. You could also clean up and generate language models for other text sources, like Project Gutenberg, with minimal work.

How do I use it?

Each AutoCorpus tool has a manpage (see below). Additionally, you can see how to combine the tools to accomplish common tasks at autocorpus (7).

ngrams-freq-filter	manpage
ngrams-sort	manpage
ngrams	manpage
sentences	manpage
tokenize	manpage
wiki-articles	manpage
wiki-textify	manpage

Is AutoCorpus free software?

Yes. AutoCorpus is distributed under the terms of the Affero General Public License v3. The source code is available on GitHub at https://github.com/mpacula/AutoCorpus. You can also use GitHub to submit feature requests and bug reports.

If you would like to use AutoCorpus in a proprietary product, please at contact the author and inquire about a commercial license.

Releases

The current stable version is 1.0.1 and was released on November 24, 2011.

Debian/Ubuntu	i386	autocorpus_1.0.1-1_i386.deb
	amd64	autocorpus_1.0.1-1_amd64.deb
Generic Linux	i386	autocorpus-1.0.1-i686.tar.gz
	amd64	autocorpus-1.0.1-x86_64.tar.gz
Source	all	autocorpus-1.0.1.tar.gz

Download plaintext Wikipedia and n-grams

For your convenience, language models and a cleaned-up, plaintext version of Wikipedia are available for download. All the datasets were generated using the latest version of AutoCorpus. If a particular download is not available below, it is due to file size and hosting restrictions. Please contact me for a download link. When viewing the files, you might need to manually set your editor's encoding to UTF-8.

I can also ship you a USB stick with all the data (roughly 35GB uncompressed) for a small fee. Again, shoot the author an email if interested.

All Wikipedia datasets are distributed under the Creative Commons Attribution - ShareAlike 3.0 Unported License.

Wikipedia	plaintext	sample	download
	unigrams	sample	download
	bigrams	sample	download
	trigrams	sample	download