Table of Contents
wiki-articles - extract articles from a Wikipedia XML database.
wiki-articles
[-d directory]
The wiki-articles utility reads a Wikipedia
XML database of articles from standard input and extracts MediaWiki markup
for each article. The output is printed to standard output, or, if the -d
parameter is specified, to files in a directory.
wiki-articles can work
with very large databases (tens of gigabytes) without running our of memory.
If output directory is not specified, articles are printed to standard
output and delimited by page feeds (C character ’\f’).
- -d the otuput
directory
- If set, articles will be saved in the directory with each article
in a separate file. The files will be named after the articles.
- Command:
cat ../wikipedia/data/enwiki-20110620-pages-articles.xml | wiki-articles
- Output (truncated):
{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}
{{Redirect|Anarchists}}
{{Good article}}
{{pp-move-indef}}
{{Anarchism sidebar}}
’’’Anarchism’’’ is a [[political philosophy]]
Autocorpus was written by Maciej Pacula (maciej.pacula@gmail.com).
The project website is http://mpacula.com/autocorpus
autocorpus(7)
,
ngrams(1)
, ngrams(5)
, ngrams-freq-filter(1)
, ngrams-sort(1)
, sentences(1)
,
tokenize(1)
, wiki-textify(1)
,
Table of Contents