Table of Contents

Synopsis

This document describes the ngram format used by the autocorpus utilities.

Description

Autocorpus utilities that read and write ngrams, such as ngrams, ngrams-freq-filter and ngrams-sort, use the following input/output format:
<sum of counts>
C    word1 ... wordn
C    word1 ... wordn
      ...

where words are delimited by spaces and the count ’C’ is delimited by a tab character. Each ngram is printed on a separate line and all ngrams are unique. The first line is always the sum of all ngram counts.

The special words <s> and </s> denote start and end of sentences, respectively.

See Also

ngrams(1) , ngrams-freq-filter(1) , ngrams-sort(1)


Table of Contents