Speech synthesis means creating human-like speech using a machine, which is known as speech synthesizer.
There are several types of these synthesizers, but each is made to do the same: to reproduce the given text in the clearest and most understandable manner.
There are four basic approaches:
On the picture is block diagram of a general synthesizer. Of course this diagram is simplified to our needs and some elements (such as feedback found in some learning synthesizers, etc.) are omitted. However, virtually every synthesizer consists of following parts:
To achieve natural synthesized speech the synthesizers should fulfill more complex tasks like preprocessing and post-processing. Ideally, to be as humanly as possible, to system should be adaptive and able to learn. Such system would consist of four basic modules: phonetic transcription of words, word class identification (mainly for German and Slavic languages which use inflection), phonetic transcription of abbreviations and prosody modification module (such as different accent and speed when asking, when commanding etc.).
In the following example the focus in only on synthesis process.
This is example where can be speech synthesizer used. The major advantage of this solution is naturally sounding voice and a small database size. Slovak language has only 1550 diphones and this makes the size of the solution very reasonable (especially compared to other approaches).
A diphone, together with phoneme, is one of major speech elements. It consists of two neighboring phonemes. The boundaries of diphone are in the middle of these sounds. This means, that a diphone length is not double, as one might suspect, but approximately the same as length of one phoneme. The advantage of using diphones and not phonemes is that they better represent change between sounds, because they boundaries are in the middle of sounds where the characteristic time curve is most stable.
In theory, the number of diphones is the square of number of phonemes (all combinations of two phonemes is a square). However, the real number is lower, because the particular language does not use, or does not utilize all of them. We can get the real number of diphones by closely studying the language. Diphone database consists of real speech recordings which are broken into small parts – diphones. There are two options how to create and record this database. Either to choose words, which will cover all diphones from a dictionary, or use some other approach. These words in dictionary don’t have to have a meaning; the aim is to have smallest possible set of recordings.
Design of a diphone synthesizer is on picture below. It describes in very simple form how the synthesizer works.
The input text has to be synthesized into speech. But at first it has to be broken down into SAMPA. In the first step all characters are retyped to SAMPA. In the second step result from the first step is retyped according all rules for pronunciation for each language (in our case Slovak language). For each diphone is checked if it is in the database and the corresponding units are selected and concatenated together. The output is synthesized text.
Some examples of speech enabled applications are: personal speech assistants, mobile assistants for the blind, working city guides, timetables and navigation systems, web-based multimodal services, applications for documenting traffic accident reports, or speech-based inventory and time management services. Nowadays very popular are book readers with implementing speech synthesis, mainly for English language.