Form Before Function

August 22nd, 2013

I have been re-reading Child Language Acquisition and Growth (Barbara Lust, Cambridge University Press, 2006).

From page 217:

By the time children are about to produce first words and simple sentences … they have realized the basis of the skeleton of the sentence … even though they still do not understand the meaning of language. (My reduction of a larger paragraph).

This is a profound statement with strong implications for language theory. It means that a child learns how a sentence is put together before they know what it means. When an infant is lying in a crib watching its parents talk, the sentences are being parsed into the phrase constituents, even though the meaning of the words is not understood.

The $64,000 question is: how does a child do this?   I am convinced that the way it is done is pretty simple.  It is a process that can be performed without cognitive intervention, the child doesn’t think about it while it is happening.

Maybe someday we will create a model that performs the same feat!


High Speed Incremental Parsing

June 27th, 2013

I have been developing a new technique for parsing natural language text.   The result is a parser that is many times faster than existing techniques.   For example, I found this this paper Deep Parsing In Watson that achieves about 5000 words per second.   The incremental parser gets speeds around 48,000 words per second.

This new parser is based on an incremental approach to parsing.   It processes each word in sequence.  For each word, it makes a set of decisions.   After the parser is done with a given word, those decisions are not changed.

Here is a link to a page with more details about this new approach to parsing: Ridgefield Parser


December 9th, 2011

As part of debugging the text parser I am working on, I have been delving into phrase structures that are in the Penn Treebank for the utterances from the Wall Street Journal.

This is an example of phrase ‘more than a xxx’ as labeled in the Treebank.   In the ~40,000 utterances, the phrase is found about 13 times, 8 times with the flat structure (NP or noun phrase) and 5 times with the branching structure (QP  or quantifier phrase).

The phrase is found with different nouns at the end: ‘quarter’, ‘year’ and ‘decade’.  Year is found 6 times (2 branch, 4 flat), decade is found 6 times (3 each branch and flat) and quarter is found only once in the flat structure.   It is interesting how ‘year’ and ‘decade’ are split between the two structures.

I haven’t yet found any clues in the sentences that contain these phrases as to why the individual instances are labeled differently.   Maybe someday when I have time to go read the literature, I will be able to find the reason for this.

Phrase Rules and Zipf’s Law

December 6th, 2011

I have been working with the Penn Treebank for about 4 years.  It is a list of sentences from the Wall Street Journal that have been manually tagged for part of speech tags and phrase/sentence structure . Only now, have I understood one of the primary issues with the way language works – as reflected by this corpus. The phrase structures in the corpus are Zipfian.

It is easier to understand Zipfian nature by studying words in a text corpus. Word occurrences are well known to have a Zipfian distribution. Some words occur very often. For example ‘the’, ‘of’ or ‘and’ occur many times in the WSJ corpus. In the first 10,895 sentences of the corpus, these words occur 13,542, 6,390 and 4,696 times respectively. Then there is a long list of words that occur fewer times that this, but still more than once. And then there is a long list of words that only occur once.

  • 10,895 Lines in the section of the corpus
  • 233,815 Total words
  • 19,570 Distinct words
  • 11,720 words that only occur once.

More than half of the words are only found once. Nothing new here. This is covered in all introductions to text processing.

Now, we move to phrase structure rules. By phrase structure rules, I mean those things that define how words can be put together to create a phrase. For example ‘the cat’ is a noun phrase. The rule for this phrase says that a noun phrase is made up of a determiner (the) followed by a noun (cat). In language, there are a variety of phrase types: noun phrase, verb phrase, prepositional phrase, adjective phrase and of course sentences. The entire list of phrase types used in the Penn Treebank corpus can be found at this link.   PennTags

There are many different rules for noun phrases. Here are some examples.

  • Determiner followed by a noun (dt nn)  ”The cat”
  • Determiner, adjective, noun (dt jj cat) “The fat cat”
  • Cardinal number, plural noun (cd nns) “Three cats”

In the Penn Treebank corpus, there are about 2685 distinct rules for noun phrases (depending on how you count).

What I didn’t understand or internalize until now is that phrase structure rules are also Zipfian. Of the 2685 rules for noun phrases. The rule ‘dt nn’ is found 8166 times, ‘np pp’ is found 7607 times. There are many rules with fewer occurrences, and then there are 1458 rules with only one occurrence. There are 6797 rules for all rule types (noun, verbs, adjective, etc). Of those rules 3767 are only used once (more than half of the rules). When I found this, I said to myself “that is a lot!”. So I dug further into this: there are 3003 sentences out of 10,895 that have a constituent branch with one of these single instance rules.   Almost one-third of the sentences have at least one phrase that has a unique rule.

A typical test for a parser is to train the parser on one part of the corpus, and test on a separate part of the corpus. This means we learn the available phrase rules on the training section and then apply those rules and only those rules to the test section. But by what we just found out, about one-third of any group of sentences that we use for testing, have phrase rules that aren’t found anywhere else. So one-third of the test sentences can never be parsed correctly by a parser that is applied in this manner.

The Power of the Human Mind

February 22nd, 2010

I received this text in an email.  It pretty much says it all.

i cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it dseno’t mtaetr in waht oerdr the ltteres in a wrod are, the olny iproamtnt tihng is taht the frsit and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it whotuit a pboerlm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Azanmig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt!

More Ideas from Yoav Seginer

February 6th, 2010

Yoav Seginer wrote his dissertation on the Incremental Parser.  The paper is pretty easy to read – accessible.  The introduction is an especially well written introduction to unsupervised grammar induction.

I was surprised to read what he had to say about substitutability.   Substitutability is the capacity to replace phrases with other phrases that are of the same type.  For example ‘the dog’ in ‘the dog ran to town’ can be replaced with ‘it’.   So in some sense, the phrase ‘the dog’ and ‘it’ can be substituted for each other.  This is one of the cornerstones of linguistic theory and is used as a basis of many parsing techniques.  The PCFG parser uses probabilities for a phrase type that can be traded out in a given context.

However, Seginer makes the claim that substitutability is not required for his incremental parser.

Substitutability, the essential idea of the Harris method, which has been seen as a starting point for the induction process for so long, turns out to be unnecessary in unsupervised parsing.

 …unlabeled parsing which only requires the parser to identify the constituents (or dependency links) but does not require them to be labeled, is purely syntagmatic (by definition).   A parser induction algorithm can therefore focus on learning to detect syntactic units while ignoring substitutability. (p20)

Another fresh idea (to me) from Seginer’s paper is the skewness of language structure.

The syntactic structure of natural language is skewed. This simply means that when the syntactic structure of an utterance is represented by a tree, each node in the tree has at least one short branch. The shorter the shortest branch is, the greater the skewness.  (p22)

Essentially how the incremental parser takes advantage of skewness is to expect skewness in the parse result.  This reduces the search size and thereby make the parsing process more efficient.

…context free grammars, allow (a-priori) any tree structure and, therefore, a learning algorithm for such representations must discover by itself the skewness property of syntactic trees. However, if this property is indeed universal, there is no need to burden the learning algorithm with its discovery and it is possible to code skewness directly into the parser.  (p23)

He claims that ‘coding the skewness’ into the syntactic representation and the parser, i.e., expecting branches to be of mixed depths, does not retract from the accuracy of the parse result.

Here is a link to Yoav’s dissertation.  Dissertation

Since graduating in 2007, it looks like Yoav Seginer is working at a  small company in Amsterdam, Mondria Technologies Ltd (according to  The company website doesn’t say anything yet.  I wonder if they are working on a project that uses the incremental parser.

Incremental Unsupervised Grammar Learning

February 5th, 2010

This paper by Yoav Seginer is very exciting.  It covers a method of learning a language grammar that resonates with my mental model of how a young child learns language.  These are some aspects of the Seginer algorithm.

Incremental Learning.  The system adds information to the grammar with each new sentence.  In other approaches (CCM by Manning and Klein, UDOP by Bod), the entire corpus is processed as a block to collect the parameters of the grammar and the entire corpus is repeatedly processed until the learning converges on final result.  In Seginer’s approach, the corpus is processed one sentence at a time.  After each sentence, the grammar weights are updated.  That sentence is not revisited for further training.

Simpler Math.  The approach by Seginer uses much simpler math for computing the parameters of the resulting grammar.   There are a few ratios, some accumulation of values into other, and some comparison of weights to choose which one to apply for a given step.  There are no long chains of probabilities to compute the best parse.  I don’t have any principled reason why this makes more sense for a model of how the brain works – it just fits my gut feeling better.

Not Restricted to Binary Trees.  The CMM approach and others approaches that are referenced in Seginer’s paper all give binary parse trees as the result.  But natural languages aren’t limited to binary trees.  Although it is possible to represent any non-binary tree as a binary tree, forcing a binary representation onto natural language is not adding to the value of the result.  Seginer’s process gives parse trees that include constructions of more than two nodes.

Exocentric Constructions.   A phrase like ‘the boy’ has links going both directions between the two words.  Either word can be considered the head of the phrase – which in some way matches the disagreement between linguists about which word is actually the head of the phrase.

Link depths.  His algorithm result include a depth value on dependency links.  This value can be used to distinguish between internal and external arguments to a phrase.

Incremental Parsing.  For each word read into the sentence, links can only be added to or from the new word.

             I know the boy (sleeps)

When the algorithm encounters the word sleeps it can only add a link to or from this word.  No links that were generated at earlier steps can be affected.   This reduces the search space and contributes to the overall speed of the parser.  Another advantage is that the links in ‘I know the boy’ are an exact subset of the links in ‘I know the boy sleeps’.  There is no decision necessary about dropping links that were previously discovered.

No Clustering.  Seginer’s approach groups words with class labels that in some respect are part of speech classes.  However, the approach does not require clustering of words into specific classes.  It finds similarities between the and a but it does not require that all determiners are grouped together in a well defined class.  This makes the approach better able to deal with noise in the training data.

Homophony.  When a word has more than one meaning, it can confuse any machine algorithm.  Seginer’s algorithm deals with homophony by comparing labels between target pairs of words.  In the example that follows, words in brackets are possible labels for the target words.  When an underscore appears in the label, it means look at  the label of the preceding or following word.

             This[the] year[_the]    (this is a determiner)

            This[is_] was [is]          (this is a pronoun).

In the example given above, in one case this is a determiner.  One of its labels is [the] because this and the occur in similar places.  In the second line a label for this is [is_] which means that this is frequently followed by is.  The algorithm finds pairs of labels such as [the] and [_the] or [is_] and [is] to decide where to put links.

I think Seginer’s approach has some real merit.  It is a greedy algorithm – it learns as it goes and does not need to revisit previous sentences.  It uses simple math.  It has results that seem to match psycholinguistic models.   It would be great to see this approach extended to take advantage of other linguistic phenomena such as morphology.

 A link to the paper can be found here.  Paper

 A link to a video presentation on the approach can be found here.    Video

Important Language Characteristic

January 10th, 2010

Today I was pondering the issue of head directionality of languages – does the head of a phrase come before or after the remaining portion of the phrase.  This parameter has been largely disregarded because most languages are inconsistent its use.

However, it occurred to me that infants could rely on the heavy weighting of head directionality when they are first learning a language.  By choosing the direction that is predominant and ignoring the incoming utterances of the opposite direction, it would simplify the initial learning phase.

Perhaps someone has already proposed this little insight into child language acquisition.

As I was pondering this, it also occurred to me that the single most important characteristic of a language is that it can be learned by an infant.  If the majority of infants in a culture can’t learn the language of their parents, then that language will not persist.  It will die out.

Perhaps someone else has already proposed this little insight as well.

Videos on Unsupervised Learning of Syntax

January 1st, 2010

Here are two links to videos on Unsupervised Learning of natural language syntax.

Chris Manning gave this talk at MLCS 2007.  It is a fairly detailed discussion of the work that Chris did with Dan Klein on learning syntax structure.  Dan used the material for his dissertation.

Chris Manning Video

Dan Klein gave this talk at UAI 2008.  It is more high level, and describes how Dan and his students have applied their work to a few other areas of unsupervised learning.

Dan Klein Video

Both these videos are worth watching.

Learning is a life long adventure

September 2nd, 2009

I finished my MA degree at the University of Washington.  Many thanks to the faculty and staff that helped me finish.  Many thanks to my wife and family that supported me while I struggled through it.

However, this week I was reminded that there are limits to how much you can learn in a degree program that lasts only 5 quarters.   A blog at LingPipe (a review of my blog) ( pointed out several errors in what I had written about here in one of my posts.  Well, lucky for me I hadn’t intended the text to be submitted to a conference jury for review.  The paper was just a one-week assignment.  As I remember, I passed the class, so the paper probably met the requirements for the course even though it has very clear short comings in the eye of the LingPipe reviewer.

Maybe I will find the time to dig out the code that was used for that paper, and compare it against the evaluation at LingPipe and see if I can understand what the LingPipe objections were.  I might even try to re-implement the code following the expert’s directions.

Another way I look at this is: the more you know, the more you know that you don’t know.  It has been healthy for me to receive some constructive criticism of my post.  I have been writing software for only 30 years, so maybe in the next 30 years, I can learn how to fix the problems that LingPipe pointed out.  For now, I am just an average software engineer.  Code reviews are part of the process.