Archive for November, 2007

Corpus size for ngram training

Thursday, November 8th, 2007

As part of my graduate courses at the University of Washinton, we are studying ngram based language models.  This means learn the possible groups of words.  If N is three, then all the possible groups of 3 words as found in a corpus.

Yesterday in class we were discussing the size of corpus required to train a ngram language model.  Our professor  said that for a tri-gram model, perhaps a billion words would be sufficient, but for larger ngram sizes as much as a trillion words would be required.

Philosophically, I think this points out the limitations of n-gram training.  A human has a total corpus that is much smaller than this.

15 yr * 365 d/yr * 16 hr/d * 3600 sec/hour * 1 word/sec = ~315 million

This number gets smaller if you realize that humans don’t have constant input for 16 hours a day.  This number gets larger if you think that adults need more than 15 years to be fully functional in today’s world of specialized knowledge.

But never-the-less, this is less than a billion words and certainly less than a trillion words.  Having the ability to process input in a way that detects linguistic structure gives us humans an advantage over systems that can’t pick out structure.