<sum of counts> C word1 ... wordn C word1 ... wordn ...
where words are delimited by spaces and the count C is delimited by a tab character. Each ngram is printed on a separate line and all ngrams are unique.
The special words <s> and </s> denote start and end of sentences, respectively. See ngrams(5) for details.
ngrams is designed to work with very large files using constant memory, where the memory limit can be specified by the user. While the more memory the better, ngrams is capable of processing tens of gigabytes of data using only a few megabytes of RAM (a disk cache is used for auxiliary storage).
The value of LIMIT does not affect the maximum size of datasets that ngrams can process: it is possible, though not recommended, to process gigabytes of data in a few megabytes of memory.
The actual memory used by ngrams will be equal to c*LIMIT, where c is a small system-dependent constant. Hence it is possible for ngrams to run out of memory even if LIMIT is less than the total amount of available RAM. If such a situation occurs, decreasing LIMIT should fix the issue.
echo -e "this is a test\nthis is yet another test" | ngrams -n 2
11 2 <s> this 1 a test 1 another test 1 is a 1 is yet 2 test </s> 2 this is 1 yet another
The project website is http://mpacula.com/autocorpus
Table of Contents