tokenize(1) manual page

Name

tokenize - transform sentences into space-delimited words while ignoring punctuation.

Synopsis

tokenize [--keep CHARACTERS] [--parens]

The tokenize utility transforms sentences into space-delimited words (tokens), while ignoring punctuation. Sentences are read from standard input, assumed to be delimited by line breaks, and outputted to standard output also delimited by line breaks. Page feeds and paragraph breaks are preserved.

tokenize ignores text in parentheses unless otherwise specified (see the --parens option).

All output is downcased.

Options

--keep CHARACTERS: specifies punctuation characters that should not be ommitted. Any characters specified in this argument will appear as separate tokens in the output.
--parens: if present, text inside parentheses will not be ignored, and parentheses will appear as separate tokens in the output.

Examples

Command:

echo "An Introduction to Natural Language Processing, \
Computational Linguistics, and Speech Recognition \
(a book by Jurafsky and Martin)" | tokenize

Output:

an introduction to natural language processing computational linguistics
and speech recognition

Command:

echo "An Introduction to Natural Language Processing, \
Computational Linguistics, and Speech Recognition \
(a book by Jurafsky and Martin)" | tokenize --keep "," --parens

Output:

an introduction to natural language processing , \
computational linguistics , and speech recognition ( a book by jurafsky
and martin )

Author

Autocorpus was written by Maciej Pacula (maciej.pacula@gmail.com).

The project website is http://mpacula.com/autocorpus

Name

Synopsis

Description

Options

Examples

Author

See Also