Speech synthesis
'''Speech synthesis''' is the artificial production of human speech. A system used for this purpose is termed a speech synthesizer, and can be implemented in software or hardware. Speech synthesis systems are often called text-to-speech (TTS) systems in reference to their ability to convert text into speech. However, there exist systems that can only render symbolic linguistic representations like phonetic transcriptions into speech.
Overview of speech synthesis technology
A text-to-speech system (or engine) is composed of two parts: a front end and a back end. Broadly, the front end takes input in the form of text and outputs a symbolic linguistic representation. The back end takes the symbolic linguistic representation as input and outputs the synthesized speech waveform. The naturalness of a speech synthesizer usually refers to how much the output sounds like the speech of a real person. The intelligibility of a speech synthesizer refers to how easily the output can be understood. The front end has two major tasks. First it takes the raw text and converts things like numbers and abbreviations into their written-out word equivalents. This process is often called text normalization, pre-processing, or tokenization. Then it assigns phonetic transcriptions to each word, and divides and marks the text into various prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme (TTP) or grapheme-to-phoneme (GTP) conversion. The combination of phonetic transcriptions and prosody information make up the symbolic linguistic representation output of the front end. The other part, the back end, takes the symbolic linguistic representation and converts it into actual sound output. The back end is often referred to as the synthesizer. The different techniques synthesizers use are described below.History
Long before modern electronic signal processing was invented, speech researchers tried to build machines to create human speech. Early examples of speaking heads were made by Gerbert of Aurillac (d. 1003), Albertus Magnus (1198-1280), and Roger Bacon (1214-1294). In 1779, Christian Kratzenstein of St. Petersburg built models of the human vocal tract that could produce the five long vowel sounds (a, e, i, o and u). This was followed by the bellows-operated Acoustic-Mechanical Speech Machine by Wolfgang von Kempelen of Vienna, Austria, described in his 1791 paper Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine ("mechanism of human speech with description of his speaking machine", J.B. Degen, Wien). This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. In 1837 Charles Wheatstone produced a speaking machine based on von Kempelen's design, and in 1857 M. Faber built the Euphonia. Wheatstone's design was resurrected in 1923 by Paget. [[Bell Labs's VODER was exhibited at the 1939 New York World's Fair and produced clearly intelligible speech.]] In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World's Fair. Early electronic speech synthesizers sounded very robotic and were often barely intelligible. Output from contemporary TTS systems is sometimes indistinguishable from actual human speech. Despite the success of electronic speech synthesis, research is still being conducted into mechanical speech synthesizers for use in humanoid robots. Even a perfect electronic synthesizer is limited by the quality of the transducer (usually a loudspeaker) that produces the sound, so in a robot a mechanical system may be able to produce a more natural sound than a small loudspeaker. The first computer-based speech synthesis systems were created in the late 1950s and the first complete text-to-speech system was completed in 1968. Since then, there have been many advances in the technologies used to synthesize speech. See the examples below for state-of-the-art commercial and free text-to-speech systems. References:- Dennis Klatt's History of Speech Synthesis
- History and Development of Speech Synthesis (Helsinki University of Technology)
Synthesizer technologies
There are two main technologies used for the generating synthetic speech waveforms: concatenative synthesis and formant synthesisConcatenative synthesis
Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis gives the most natural sounding synthesized speech. However, natural variation in speech and automated techniques for segmenting the waveforms sometimes result in audible glitches in the output, detracting from the naturalness. There are three main subtypes of concatenative synthesis:- Unit selection synthesis uses large speech databases (more than one hour of recorded speech). During database creation, each recorded utterance is segmented into some or all of the following: individual phones, syllables, morphemes, words, phrases, and sentences. The division into segments can be done using a number of techniques, like clustering, using a specially modified speech recognizer, or by hand, using visual representations such as the waveform and spectrogram. An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch). At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This technique gives the greatest naturalness due to the fact that it does not apply digital signal processing techniques to the recorded speech, which often makes recorded speech sound less natural. In fact, output from the best unit selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness often requires unit selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data and numbering into the dozens of hours of recorded speech.
- Diphone synthesis uses a minimal speech database containing all the Diphones (sound-to-sound transitions) occurring in a given language. The number of diphones depends on the phonotactics of the language: Spanish has about 800 diphones, German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as Linear predictive coding, PSOLA or MBROLA. The quality of the resulting speech is generally not as good as that from unit selection but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available implementations.
- Domain-specific synthesis concatenates pre-recorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. This technology is very simple to implement, and has been in commercial use for a long time: this is the technology used by things like talking clocks and calculators. The naturalness of these systems can potentially be very high because the variety of sentence types is limited and closely matches the prosody and intonation of the original recordings. However, because these systems are limited by the words and phrases in its database, they are not general-purpose and can only synthesize the combinations of words and phrases they have been pre-programmed with.
Formant synthesis
Formant synthesis does not use any human speech samples at runtime. Instead, the output synthesized speech is created using an acoustic model. Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called Rule-based synthesis but some argue that because many concatenative systems use rule-based components for some parts of the system, like the front end, the term is not specific enough. Many systems based on formant synthesis technology generate artificial, robotic-sounding speech, and the output would never be mistaken for the speech of a real human. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have some advantages over concatenative systems. Formant synthesized speech can be very reliably intelligible, even at very high speeds, avoiding the acoustic glitches that can often plague concatenative systems. High speed synthesized speech is often used by the visually impaired for quickly navigating computers using a screen reader. Second, formant synthesizers are often smaller programs than concatenative systems because they do not have a database of speech samples. They can thus be used in embedded computing situations where memory space and processor power are often scarce. Last, because formant-based systems have total control over all aspects of the output speech, a wide variety of prosody or intonation can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.Other synthesis methods
- Articulatory synthesis is a synthesis method mostly of academic interest at the moment. It is based on computational models of the human vocal tract and the articulation processes occurring there. These models are currently not sufficiently advanced or computationally efficient to be used in commercial speech synthesis systems.
- Hybrid synthesis marries aspects of formant and concatenative synthesis to minimize the acoustic glitches when speech segments are concatenated.
- HMM-based synthesis is a synthesis method based on an HMM. In this system, speech spectrum (vocal tract), Fundamental frequency (vocal source), and duration (prosody) are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on Maximum likelihood criterion.
Front-end challenges
Text normalization challenges
The process of normalizing text is rarely straightforward. Texts are full of homographs, numbers and abbreviations that all ultimately require expansion into a phonetic representation. There are many words in English which are pronounced differently based on context. Some examples:- project: My latest project is to learn how to better project my voice.
- bow: The girl with the bow in her hair was told to bow deeply when greeting her superiors.
Text-to-phoneme challenges
Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion, as phoneme is the term used by linguists to describe distinctive sounds in a language. The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciation is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach used for text-to-phoneme conversion is the rule-based approach, where rules for the pronunciations of words are applied to words to work out their pronunciations based on their spellings. This is similar to the "sounding out" approach to learning reading. Each approach has advantages and drawbacks. The dictionary-based approach has the advantages of being quick and accurate, but it completely fails if it is given a word which is not in its dictionary, and as dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as it takes into account irregular spellings or pronunciations. As a result, nearly all speech synthesis systems use a combination of both approaches. Some languages, like Spanish, have a very regular writing system, and the prediction of the pronunciation of words based on the spelling works correctly in nearly all instances. Speech synthesis systems for languages like this often use the rule-based approach as the core approach for text-to-phoneme conversion, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciation is not obvious from the spelling. On the other hand, speech synthesis for languages like English, which have extremely irregular spelling systems, often rely mostly on dictionaries and use rule-based approaches only for unusual words or names that aren't in the dictionary.Speech synthesis markup languages
A number of markup languages have been established for rendition of text as speech in an XML compliant format, the most recent being SSML proposed by the W3C which is in draft status at the time of this writing. Older speech synthesis markup languages include SABLE and JSML. Although each of these was proposed as a new standard, still none of them has been widely adopted. A subset of the Cascading Style Sheets 2 specification includes Aural Cascading Style Sheets. Speech synthesis markup languages should be distinguished from dialogue markup languages such as VoiceXML, which includes, in addition to text-to-speech markup, tags related to speech recognition, dialogue management and touchtone dialing.See also
- Speech processing
- Speech recognition
- Speech to text (dictation)
- Natural language processing
- Sonification (the use of non-speech audio to convey information)
- Software Automatic Mouth
- Apple PlainTalk
- FreeTTS
External links
Misc
- Samples of commercial TTS systems.
- Free Speech Synthesis system designed for the vocally impaired, with links to other speech related assistive technologies and resources for PALS.
- Speech Synthesis & Analysis Software
- comp.speech Frequently Asked Questions
- Free TTS Audio Books Free audio book downloads created with NeoSpeech voices
Freely available TTS systems
- Festival is a freely available complete diphone concatenation and unit selection TTS system.
- Flite (Festival-lite) is a smaller, faster alternative version of Festival designed for embedded systems and high volume servers.
- FreeTTS written entirely in Java, based on Flite.
- MBROLA is a freely available diphone concatenation system (back end).
- Gnuspeech is an extensible, text-to-speech package, based on real-time, articulatory, speech-synthesis-by-rules.
- Epos is a rule-driven TTS system primarily designed to serve as a research tool.
- HTS is a freely available HMM-based speech synthesis system (back end).
Commercially available TTS systems
- Rhetorical rVoice
- Loquendo TTS
- ScanSoft RealSpeak
- Sakrament Text-to-Speech Engine
- Nuance Vocalizer
- AT&T Natural Voices
- Microsoft Mandarin Chinese TTS Online Demo, English Demo
- ASY is an articulatory synthesis program developed at Haskins Laboratories
- Cepstral
- IBM Research TTS (U.S. English, Arabic, Chinese, French, & German speech samples)
- ATIP's German TTS Voices
Low-cost software that makes good use of TTS engines
External hardware devices
Category:Artificial intelligence Category:Computational linguistics Category:Speech synthesis Category:Speech processing ms:Lafal buatan da:Talesyntese de:Sprachsynthese eo:Parolsintezo es:Sintetización del habla zh:语音合成peech synthesis
Seech synthesis
Spech synthesis
Spech synthesis
Speeh synthesis
Speec synthesis
Speechsynthesis
Speech ynthesis
Speech snthesis
Speech sythesis
Speech synhesis
Speech syntesis
Speech synthsis
Speech syntheis
Speech synthess
Speech synthesi
pSeech synthesis
Sepech synthesis
Speech synthesis
Speceh synthesis
Speehc synthesis
Speec hsynthesis
Speechs ynthesis
Speech ysnthesis
Speech snythesis
Speech sytnhesis
Speech synhtesis
Speech syntehsis
Speech synthseis
Speech syntheiss
Speech synthessi
Speech synthesi
SSpeech synthesis
Sppeech synthesis
Speeech synthesis
Speeech synthesis
Speecch synthesis
Speechh synthesis
Speech synthesis
Speech ssynthesis
Speech syynthesis
Speech synnthesis
Speech syntthesis
Speech synthhesis
Speech syntheesis
Speech synthessis
Speech synthesiis
Speech synthesiss
peech synthesis
seech synthesis
spech synthesis
spech synthesis
speeh synthesis
speec synthesis
speechsynthesis
speech ynthesis
speech snthesis
speech sythesis
speech synhesis
speech syntesis
speech synthsis
speech syntheis
speech synthess
speech synthesi
pseech synthesis
sepech synthesis
speech synthesis
speceh synthesis
speehc synthesis
speec hsynthesis
speechs ynthesis
speech ysnthesis
speech snythesis
speech sytnhesis
speech synhtesis
speech syntehsis
speech synthseis
speech syntheiss
speech synthessi
speech synthesi
sspeech synthesis
sppeech synthesis
speeech synthesis
speeech synthesis
speecch synthesis
speechh synthesis
speech synthesis
speech ssynthesis
speech syynthesis
speech synnthesis
speech syntthesis
speech synthhesis
speech syntheesis
speech synthessis
speech synthesiis
speech synthesiss