Archive for the ‘Questions to Ponder’ Category

Corpus size for ngram training

Thursday, November 8th, 2007

As part of my graduate courses at the University of Washinton, we are studying ngram based language models.  This means learn the possible groups of words.  If N is three, then all the possible groups of 3 words as found in a corpus.

Yesterday in class we were discussing the size of corpus required to train a ngram language model.  Our professor  said that for a tri-gram model, perhaps a billion words would be sufficient, but for larger ngram sizes as much as a trillion words would be required.

Philosophically, I think this points out the limitations of n-gram training.  A human has a total corpus that is much smaller than this.

15 yr * 365 d/yr * 16 hr/d * 3600 sec/hour * 1 word/sec = ~315 million

This number gets smaller if you realize that humans don’t have constant input for 16 hours a day.  This number gets larger if you think that adults need more than 15 years to be fully functional in today’s world of specialized knowledge.

But never-the-less, this is less than a billion words and certainly less than a trillion words.  Having the ability to process input in a way that detects linguistic structure gives us humans an advantage over systems that can’t pick out structure.

Recognizing Entailment.

Friday, October 6th, 2006

This is a link to the Second Recognizing Textual Entailment Challenge.

http://www.pascal-network.org/Challenges/RTE2

Entailment means when the truth of one statement guarantees the truth of another statement.  Here is an example from O’Grady (2005).

The park wardens killed the bear. <==> The bear is dead.

If it is true that the park wardens killed the bear, then it must also be true that the bear is dead.

This challenge involves a list of sentence pairs.  The object is to write a program that can sort out the pairs where one sentence entails the other, and the pairs that do not exhibit entailment.  See the link for the samples as well as links to entries to the challenge.

After I saw this challenge, I started thinking about applications for the ability to perform this entailment test.  For example, image a web search that returns a list of pages.  From each of the pages, take the sentences that contain the words in the original search.  If the search was for ‘republican crisis’, then each sentence returned would contain those words.  From this list of sentences, we perform the entailment test.  If there are 1000 sentences, then we are doing over 500,000 sentence compares (probably not very efficient, but ignore that for now).

After the sentence comparisons, we can sort based on which sentences entailed which ones.  This will create groups of sentences where entailment was detected either one way or the other between each pair of sentences.  Now that the search pages are grouped according to this entailment test, let the user pick which group to wade into first.

In other words, let the entailment test group the pages and then let the user pick the page groups accordingly.  Continuing our ‘republican crisis’ example, maybe this entailment sorting method would be able to group pages that talk about republicans that have caused their own crisis and pages that talk about a crisis that republicans were working to provide relief for.

Anyway, this is just a question to ponder at this point.  This entailment challenge was only the second time around, and there is still much to be learned in order to improve the accuracy of the entailment detection algorithms.  And certainly, my suggestion doing a many-to-many comparison between all of the pages returned from a search is not very practical.  But the idea of being able to group the pages according to this entailment criteria is never the less very intriguing.

 

 

How do children do it?

Sunday, February 5th, 2006

Just how is it that a child can do what cannot yet be done on a computer? 

Here is a time table for an average child learning its first language:

  • 12- months – Understand single words.
  • 12+ months – Produce single words.
  • 18 months – Learn a new word every two hours (on average).  Combine words into pairs – most of the time the words are in the correct order.
  • 24-36 months – Start understanding and using complex sentences.

Source: The Language Instinct, Steven Pinker

Children are capable of learning the language of their parents without any overt training.  Even if the child is totally ignored and never spoken to directly, he/she will still learn to understand and speak.

As Steven Pinker and many other people have expressed, this learning ability probably implies the existence of structures in the brain that are prepared for this steep learning curve.  By the age of 12 months, a child’s brain is prepped to begin associating words with things, people and actions.  Something in the brain is searching for recurring patterns in words heard and the things meant by those words.

Clearly, a child learns about his environment before it learns words for it.  Children recognize their caretakers, they learn routine, they gain knowledge of objects.  They learn how they are constrained by the laws of physics, for example, they learn how objects falling when dropped.

Is learning these environmental items a prerequisite for learning a language?  Before a child can learn the word ball, is it a requirement to have previously experienced a ball?  Is there a knowledge structure in the brain that represents the ball in the child’s thoughts that is underlying the ability to speak the word ball and eventually to express ideas such as ‘drop the ball’?

 

Searching For Relationships

Saturday, February 4th, 2006

What if web searches used the relationships between words as part of the search criteria?

A web search typically treats the search request as a bag of words.  When you search for ‘yard debris’, your will get essentially the same results as for ‘debris yard’.  Google returns about 1.7 million hits in either case.

Searching for ‘blog review’ or ‘review blog’ returns the same basic results of about 95 million hits.

As always I am trying to find ways to improve how computers works for us.  Imagine asking the computer, “Where do I take my yard debris?”  Ideally, the result would be 3 to 5 hits for locations close to your home that can take your leaves and branches.  (We are having a wind storm today, so I’ll be picking up the yard debris tomorrow).

What would it take for the search to return 5 pages, instead of 1.7 million pages?

Could word clusters help with this improvement?  There are many research groups looking a word clusters for ways to extract semantic information.  An example of a word cluster is “push the spring”.  This is a sentence fragment that has spring as the object of the verb push.  The word cluster in this case could be:

’spring’->object_of->’push’

Imagine all of the verbs that could have a ’spring’ (in the sense of a coil) as an object.  Now, imagine all of the verbs that could have ’spring’ (in the sense of a season) as an object.  These two lists of verbs will be different.  There will be some overlap, but there will also be many verbs that are unique to the two sentences.

These two distinct lists of verbs that select between different senses of ’spring’ are an example of how semantics might be used to improve how computers interact with people.  Many researchers are digging into various facets of word clustering and semantic relationships.   Here are a few references.

Gamallo, Agustini, Lopes.  2005.  Clustering Syntactic Positions with Similar Semantic Requirements.   Computational Linguistics, Vol 31,1  pp. 107-145

Lin. 1998.  Automatic Retrieval and clustering of similar words.  COLING-ACL’98, pp. 768-774, Montreal.
http://www.cs.ualberta.ca/~lindek/papers/acl98.pdf

Green, Rebecca, Bonnie J. Dorr, and Philip Resnik, “Inducing Frame Semantic Verb Classes from WordNet and LDOCE”, in Proceedings of the Association for Computational Linguistics, Barcelona, Spain, 2004.
ftp://ftp.umiacs.umd.edu/pub/bonnie/green-dorr-resnik.pdf

Let me hear your comments on this subject.