As part of my graduate courses at the University of Washinton, we are studying ngram based language models. This means learn the possible groups of words. If N is three, then all the possible groups of 3 words as found in a corpus.
Yesterday in class we were discussing the size of corpus required to train a ngram language model. Our professor said that for a tri-gram model, perhaps a billion words would be sufficient, but for larger ngram sizes as much as a trillion words would be required.
Philosophically, I think this points out the limitations of n-gram training. A human has a total corpus that is much smaller than this.
15 yr * 365 d/yr * 16 hr/d * 3600 sec/hour * 1 word/sec = ~315 million
This number gets smaller if you realize that humans don’t have constant input for 16 hours a day. This number gets larger if you think that adults need more than 15 years to be fully functional in today’s world of specialized knowledge.
But never-the-less, this is less than a billion words and certainly less than a trillion words. Having the ability to process input in a way that detects linguistic structure gives us humans an advantage over systems that can’t pick out structure.