Table of Contents

Name

tokenize - transform sentences into space-delimited words while ignoring punctuation.

Synopsis

tokenize [--keep CHARACTERS] [--parens]

Description

The tokenize utility transforms sentences into space-delimited words (tokens), while ignoring punctuation. Sentences are read from standard input, assumed to be delimited by line breaks, and outputted to standard output also delimited by line breaks. Page feeds and paragraph breaks are preserved.

tokenize ignores text in parentheses unless otherwise specified (see the --parens option).

All output is downcased.

Options

--keep CHARACTERS
specifies punctuation characters that should not be ommitted. Any characters specified in this argument will appear as separate tokens in the output.

--parens
if present, text inside parentheses will not be ignored, and parentheses will appear as separate tokens in the output.

Examples

Command:

echo "An Introduction to Natural Language Processing, \
Computational Linguistics, and Speech Recognition \
(a book by Jurafsky and Martin)" | tokenize 
Output:

an introduction to natural language processing computational linguistics
and speech recognition 

Command:

echo "An Introduction to Natural Language Processing, \
Computational Linguistics, and Speech Recognition \
(a book by Jurafsky and Martin)" | tokenize --keep "," --parens
Output:

an introduction to natural language processing , \
computational linguistics , and speech recognition ( a book by jurafsky
and martin ) 

Author

Autocorpus was written by Maciej Pacula (maciej.pacula@gmail.com).

The project website is http://mpacula.com/autocorpus

See Also

autocorpus(7) , ngrams(1) , ngrams(5) , ngrams-freq-filter(1) , ngrams-sort(1) , sentences(1) , wiki-articles(1) , wiki-textify(1) ,


Table of Contents