Diphones in Text To Speech

For a phonetics class that is part of the Master’s program at the University of Washington, I wrote a research paper on how Diphones are used in text to speech systems. Essentially, Diphones are portions of words that are extracted from a recording of words or sentences.

One of the main problems with Text To Speech systems is making them sound natural by varying the prosody of the output. Prosody is the term for the variation in pitch, duration and intensity that all people use when speaking an utterance. By splitting a recording into Diphones, the system can select from a list of candidates for each slot in the output. The system finds the Diphone candidate that is closest to the desired prosody.

Here is an image that showing the word ‘maybe’. There are four phones or segments ‘m’ ‘ay’ ‘b’ ‘e’. A Diphone is two halves of two adjacent phones. The middle of the phone is the most stable portion. By splitting the recording at the middle of each Diphone there is less disturbance at the joints between Diphones that are concatenated in the simulated speech output.
Maybe
Here is a link to the paper that describes the technique of using Diphones for text to speech systems.

Text To Speech Using Diphones.pdf

Comments are closed.