Diphones in Text To Speech

January 5th, 2008

For a phonetics class that is part of the Master’s program at the University of Washington, I wrote a research paper on how Diphones are used in text to speech systems. Essentially, Diphones are portions of words that are extracted from a recording of words or sentences.

One of the main problems with Text To Speech systems is making them sound natural by varying the prosody of the output. Prosody is the term for the variation in pitch, duration and intensity that all people use when speaking an utterance. By splitting a recording into Diphones, the system can select from a list of candidates for each slot in the output. The system finds the Diphone candidate that is closest to the desired prosody.

Here is an image that showing the word ‘maybe’. There are four phones or segments ‘m’ ‘ay’ ‘b’ ‘e’. A Diphone is two halves of two adjacent phones. The middle of the phone is the most stable portion. By splitting the recording at the middle of each Diphone there is less disturbance at the joints between Diphones that are concatenated in the simulated speech output.
Maybe
Here is a link to the paper that describes the technique of using Diphones for text to speech systems.

Text To Speech Using Diphones.pdf

Corpus size for ngram training

November 8th, 2007

As part of my graduate courses at the University of Washinton, we are studying ngram based language models.  This means learn the possible groups of words.  If N is three, then all the possible groups of 3 words as found in a corpus.

Yesterday in class we were discussing the size of corpus required to train a ngram language model.  Our professor  said that for a tri-gram model, perhaps a billion words would be sufficient, but for larger ngram sizes as much as a trillion words would be required.

Philosophically, I think this points out the limitations of n-gram training.  A human has a total corpus that is much smaller than this.

15 yr * 365 d/yr * 16 hr/d * 3600 sec/hour * 1 word/sec = ~315 million

This number gets smaller if you realize that humans don’t have constant input for 16 hours a day.  This number gets larger if you think that adults need more than 15 years to be fully functional in today’s world of specialized knowledge.

But never-the-less, this is less than a billion words and certainly less than a trillion words.  Having the ability to process input in a way that detects linguistic structure gives us humans an advantage over systems that can’t pick out structure.

Dolphin Speak

October 10th, 2007

Here is another Gary Larson perspective on trying to understand another species.

Of course, a human language is composed of constituents (phrases) that can be combined in many different ways.  If these dolphins were capable of human type communication, then these ’scientists’ would at least be looking for pieces of sentences and recombinations of those sentences in novel ways instead of just repitition of the top level sentences.

But still, it is funny.

DolphinSpeak

My appologies to the copyright holder of this image.

Bender’s Axiom

October 6th, 2007

I am taking Ling566, Introduction to Syntax for Computational Linguistics, at the University of Washington as part of a Master’s program in Computational Linguistics.  The course is taught by Dr. Emily Bender who is also the director of the program.

This week Emily was introducing how feature structures are used to create a grammar description for English.  A big part of the grammar is the syntax portion, how words are formed into phrases and phrases are joined into sentences.  Feature structures are a way of adding detail to a grammar so that things like agreement can be accounted for.

As part of her lecture she said, “There is no magic in syntax.”

What she means by this is that when specifying the grammar using feature structures, all of the details have to be specified.  If something is left out of the definition, then the grammar will not work correctly.

A statement that is similar to this that I am fond of repeating is “It does exactly what you tell it to.”  What is meant by this is that the computer is a machine that executes the instructions given to it – it executes them faithfully.  When a program runs correctly and performs the desired actions without any negative side effects, this is because the program was written that way.  And just the same, when a program crashes and you lose your data, this is because the instuctions in the program have been arranged in a way that makes it crash.

At any rate, I am thoroughly enjoying taking classes at UW.  It is a thrill to be spending all of my time focused on CL.

 

 

Steven Pinker in Person

September 28th, 2007

We went to see Steven Pinker at the Seattle Town Hall.  He is promoting his new book - The Stuff of Thought.  I have read several of his books, so I was pleased to have a chance to hear him speak.

One of the focus points of his talk was how English uses prepositions to designate space and time.  For example, he asked why do we say something is under water when the object is truly surrounded by water, and why do we say after dark when we really mean a time period surrounded by darkness.  His proposition is that the mind simplifies its perspective when possible (Occam’s Razor?).  The surface of water become a 2-D boundary which then an object can be above or under.  Similarly, the boundary of nighttime (darkness) becomes a point in time after which we say ‘after dark’.  As further illustration of the dimensional reduction, he pointed out that we don’t say, “an ant walks along a plate”, because the preposition along requires a one dimensional object, and that we do say, “an ant walks along the edge of a plate” because in one sense, the edge can be tought of as a one dimensional object.

The most entertaining portion of his talk was about how swearing is used.  I suspect the reason it was so funny was the contrast between his clinical descriptions (formal register) of swearing and the familiar register that is used when someone is swearing.  He gave the example of someone accepting an award for popular music saying “this is really f***ing brilliant” and saying how in this case “f***ing” is used as an adverb.

Another example he gave that was astonishing was the case of the world trade center disaster.  Apparently, the insurance contract has a phrase of “3.5 billion dollars per event”.  The court case was held up on interpretting whether the 9/11 incident was one event, as in one master plan of destruction was executed, or if it was two events, as in two airplanes were used to destroy two buildings.  The effect of this distinction was whether the insurance should pay $3.5 billion or $7 billion.  Quite a substantial difference that is based on the judgement of a linguist.

Overall, Dr. Pinker’s presentation was very entertaining and enlightening.  If you have a chance to hear him speak, I recommend that you go.

Gary Larson As Linguist

September 16th, 2007

This year I have been enjoying a Gary Larson daily calendar.  It goes without saying that Gary has a unique insight into reality in our lives.  Many of his cartoons use issues that are illuminated by a linguistic view point.

For example, in this frame, the dog has written a threat letter to the cats, but the dog only uses one word.

Our dog certainly has a wider vocabulary than one kind of bark, but for each situation such as a barking at a stranger, he only uses one ‘word’.  However, he does vary his barks.  Some barks are louder and there is variation in pitch.  His series of barks could be interpretted as having prosody (variation in pitch and emphasis).  Of course, we as humans can’t tell if there is any information that can be interpretted from the variation, or if it just means that he is not capable of generating a series of barks that are identical.

DogThreatLetters.JPG

Here is another frame relating to dogs.  Dogs certainly understand many human words – their name, ‘out’, ‘walk’, ’sit’, ‘go lay down’, etc.  But dogs don’t make a relationship between words when uttered in a series.  My interpretation is that they hear one or two words in a context and use that as the entire meaning of the situation.  Our dog is very tuned into ‘walk’.

WhatDogsHear

This frame is about meeting aliens and trying to communicate through translation of language.  The assumption is that if we do ever meet an alien, that the same technique for language translation we use between human languages will also work with aliens.  This will certainly be the place to start, but what if the alien brain language structures are different than ours?  In other words, Chomksy has helped us see that all human languages are based on similar structures, but if we do meet aliens, we won’t necessarily be able to rely on the existence of that similarity.

TakeMeToYourStove

This frame shows how a misinterpretation of a foreign word can be used as a joke.  Of course, Webster’s gives us the definition for Kemosabe as “faithful friend”.

Kemosabe

This frame shows a common play on words.  Take a phrase or frequent saying and replace one or more of the words.  Also in this case, he is using a homophone (same sound different spelling) for ‘ate’ versus ‘8′.

I_8_NY.JPG

This frame makes fun of our basic drives for attracting mates.  The truth is that many of our instincts come from our more simple ancestors.  The only real difference between us and lower animals is that we are self-aware and are able to modify our behavior in much more complex ways.

AnimalsAndTheirMatingSongs.JPG

My appologies to the copyright holder of these images.

Unemployment

September 15th, 2007

I have left my job so that I can enter a graduate program in Comp Ling.  As I was saying goodbye to someone, I said, “I haven’t been unemployed since I was in highschool.”

 His reply was, “Yea, and you felt good in highschool, didn’t ya!”

Needless to say, I am a bit excited about this opportunity that I have to start in the program at the University of Washington.

Machine Learning

May 14th, 2007

As part of a Machine Learning class, my lab partner and I wrote this project description that describes Neural Networks.   We used it on a few different data sets – one of which was a hand written zip code character recognition task.

Typical Neural Network

Paper that describes neural networks

Yoda Speak

April 9th, 2007

This week I have been improving my phrase definitions for parsing sentences.  I added another rule that accepts modifiers before the phrase head, so for example “big dog” has an adjective “big” that modifies “dog”.  This is in contrast to “dog with a big tail” where “with a big tail” comes after “dog”.  In both cases they are considered modifiers.

Today I was working on “the dog with a stick ran”.  Here “with a stick” is a modifier to “dog”.  But it can also be considered a modifier to “ran”.   This is one of the things that Yoda does to sentence structures, is the modifiers are moved around in the sentence.

Here is a tree showing “with a stick” as a modifier to dog.

 

dog with a stick

Here is a tree of the same sentence showing “with a stick” as a modifier to “ran”.

with a stick ran

I’ve always known that Linguistics is Sexy

April 9th, 2007

In a New York Times article on sexual desire titled “Birds Do It. Bees Do It. People Seek the Keys to It.”, the author asks an assortment of men and women, “What is sexual desire, and how do you know you’re feeling it?”

“Listening to Noam Chomsky,” said a psychologist in her 50s, “always turns me on.”

The article was By NATALIE ANGIER Published: April 10, 2007.