Speech synthesis, or text-to-speech (TTS), is the computer-based creation of artificial speech from normal language text. Not to be confused with recorded audio playback, TTS is computer-generated speech formed from text.
How It Works
There are two main components of a TTS system:
The first is natural language processing (NLP), which converts raw text (including punctuation, abbreviation, numbers and symbols) into phonetic transcription. Included in the transcription are phonemes (parts of speech) as well as prosody (intonation, rhythm, rate of speech) derived from cues in the text.
The second component of TTS is digital signal processing (DSP), which converts the phonetic representation into words spoken through a computer or other device’s audio output. DSP requires the creation of a voice font (i.e., a human recording into the system a series of phrases that attempt to touch every combination of phonemes in the language). The system builds speech from this voice font by concatenating audio samples. It then applies algorithms to smooth the finished phrases and adjust aspects such as volume and rate of speech.
Perhaps the first true speech synthesizer was the Voder developed by Bell Labs in the 1930s. Operating it with a keyboard, developer Homer Dudley demonstrated the Voder at the 1939 World’s Fair in New York.
Speech synthesis entered the mainstream in telecommunications and electronics in the 1970s (e.g., Texas Instruments’ Speak & Spell) and in video games in the 1980s, among other places. While early systems developed robotic-sounding speech, the technology has gradually improved over the years.
Speech synthesis is an integral piece of modern telecommunications, particularly in interactive voice response (IVR) systems used widely by companies and call centers. Other applications include electronics, video games, language education, aid for the handicapped (Stephen Hawking, most notably), human-computer interaction and research.