tokenize ignores text in parentheses unless otherwise specified (see the --parens option).
All output is downcased.
echo "An Introduction to Natural Language Processing, \ Computational Linguistics, and Speech Recognition \ (a book by Jurafsky and Martin)" | tokenize
an introduction to natural language processing computational linguistics and speech recognition
echo "An Introduction to Natural Language Processing, \ Computational Linguistics, and Speech Recognition \ (a book by Jurafsky and Martin)" | tokenize --keep "," --parens
an introduction to natural language processing , \ computational linguistics , and speech recognition ( a book by jurafsky and martin )
The project website is http://mpacula.com/autocorpus