Multimedia processing
Speech recognition

A genuine speech is created by human vocal organs and can be observed as special vibrations of air. It contains except many other lexical information that is vital for speech recognition. This information is coded into the signal as a sequence of different acoustic sounds. Each language consists of set of basic sounds called phonemes that are used in building the whole vocabulary of a particular language. Their number varies based on the sort of a language; its usual number approximately ranges from 40 to 60. Unfortunately adjacent phonemes influence each other in an acoustical way. Moreover, they differ from speaker to speaker (they contain speaker specific information). Further, there is a background noise and distortion inflicted by the environment and recording device. Finally, vocabularies of developed languages may have several hundreds of thousand even millions words including all the cases, times, genders, names, etc. All this may indicate the task is extremely difficult, complex and computational time demanding.

Thus there have been designed many speech recognition systems that are categorized based on distinctive traits as: small, middle and large vocabulary systems, speaker depended and independent systems, speech modelling units (phonemes, syllables, words, phrases, etc.), systems for isolated words and continual speech recognition, etc. Systems working with over hundred thousands of words in real time scenarios have been reported.

For a couple of decades there has been a great effort spent to build and employ automatic speech recognition (ASR) systems in areas like information retrieval systems, dialog systems, etc., but only as the technology further evolved other applications like dictation systems or even automatic transcription of natural speech are emerging. In order these advanced systems are to be wide spread employed, they should be capable to operate on a real time bases, must be speaker independent, reaching high accuracy and support dictionaries containing several hundreds of thousands of words.

Speech Feature Extraction Methods

One of the first steps in the design of an ASR system is to decide which feature extraction technique to use. At the beginning it should be noted that this task is not yet completely solved and a lot of effort is still going on in this area. The aim is to simulate the auditory system of humans, mathematically describe it, simplify for practical handling and optionally adapt it for a correct and simple use with the selected types of recognition and classification methods.

As already said before, speech is a very complex signal produced by vocal organs of human beings. Such signal contains many kinds of information e.g. lexical (what is said), speaker identification part (who speaks), what actual mood he/she is in, where is he/she from (dialect), social status, health condition, speech disorder, and many more. All that information is indirectly encoded into the final speech signal during the speech production process using brain and vocal organs.

The information stream of a speech signal is about 100kb/s that can be approximately estimated using its frequency band and the number of needed quantization levels. However, when the lexical information is tested it was found based on the number of phones per second that it is approximately 10b/s.

This shows the big redundancy in terms of lexical information. Therefore the extraction method must significantly reduce the information bit rate and keep only the lexical one. However, this task is very complicated as the process of coding all parts of information into a single speech signal is rather complex and not fully reversible.

Naturally there are many speech feature extraction methods that either mimics the speech production process (linear model of speech production) or they simulate the human auditory system (critical bands, EIH) rather than computing anything else. The idea is quite straightforward, i.e. the human auditory system has evolved during several hundred of thousand years to extract “only” the relevant information out of the general audio signal suppressing different sorts of noises and distortions.

The required basic characteristics of extracted features are: massive bit rate reduction, features must be “deaf” to sounds and changes that are difficult to perceive by humans, and on contrary, must be sensitive to variations that are perceivable. Based on an extensive research it was found that good indicators of differences in perception are the so called format frequencies. In the following picture there is a frequency spectrum for a vowel “e”, its magnitude envelop (has a relation to a setting of vocal organs during speech production) and depicted formant frequencies. There is also a time domain of the analyzed sound in the next picture.

Spectra, formant frequencies and a spectral envelope for a vowel “e”.
A time course of a vowel “e”.

To show how formant frequencies are related to the classification of vowels, in the following table first tree formant frequencies for all vowels are listed separately for males and females. From the table the positive shift of formats for females is obvious as well.

Averaged locations of formant frequencies for males and females

vowel

Males

Females

F1 [Hz]

F2 [Hz]

F3 [Hz]

F1 [Hz]

F2 [Hz]

F3 [Hz]

a

730

1100

2450

850

1200

2800

e

530

1850

2500

600

2350

3000

i

400

2000

2550

430

2500

3100

o

570

850

2400

590

900

2700

u

440

1000

2250

470

1150

2700

As it was already emphasized a good feature should be sensitive to differences in sounds that are perceived as different in humans and should be “deaf” to those which are unheeded by our auditory system. It was found that the following differences are audible:

On the other hand, following aspects do not play a role in perceiving acoustical differences:

Furthermore, proper features should be insensitive to additive and convolutional noises or at least they should represent them in such a way that these distortions are easy to locate and suppress in the feature space. Next, features must be mathematically tractable, compact and easy to implement in the real time. Finally, when using continuous density hidden Markov models it is required for the feasibility purposes that the elements of feature vectors should be linearly independent so that a single diagonal covariance matrix can be used.

There exist many basic speech features, but currently MFCC and PLP are the most preferable features for speech recognition and speaker identification problems. Both MFCC and PLP were designed to simulate the human auditory system aiming to achieve better scores in speech recognition tasks.

Therefore, both PLP and MFCC are designed to capture positions and widths of formants that are most perceivable. Further, they have an easy interpretation and compact representation (Euklidian distance has a reasonable acoustical meaning). Both features are kind of modified cepstral coefficients, but they differ in the calculation process. In the following a brief overview of MFCC and PLP is given.

MFCC calculation process

MFCC uses preemphasis filter (high pass) to suppress the low pass character of the lip radiation. Prior to DFT computation Hamming window is applied and the spectra is warped into the Mel scale to emulate critical bands. Then, equally spaced triangular windows with 50% overlap are used to simulate a filter bank. Finally the logarithm and DCT transform are invoked. The logarithm not only produces cepstrum but simulates how the intensities are non-linearly perceived by humans. The role of DCT is mainly to decorrelate elements of a vector. The MFCC calculation process is shown in the next figure.

MFCC calculation process.

PLP Calculation

Unlike MFCC PLP calculation engages following steps: Hamming windowing and FFT calculation, frequency warping into Bark scale, smoothing the bark-scaled spectra by a window simulating critical bands, sampling the smoothed bark spectrum in approx. 1 bark intervals to simulate the filter bank, equal loudness weighting (approximates the hearing sensitivity), transformation of energies into loudness (powering magnitudes to 0.33), calculating linear prediction coefficients from the warped and modified spectra, finally cepstral LPC coefficients are derived from LPC. As it can be seen PLP provides more complex human like auditory processing than MFCC, however they usually provide comparable results in the task of speech recognition and laboratory conditions. On the other hand PLP usually shows greater robustness in adverse conditions.

Auxiliary features – time dynamic and energy

As the speech evolve in the time, it is also important to capture these transitions, which may contain additional speech relevant information. Thus delta and acceleration (double delta) coefficients constructed over acoustic features in the time are often derived as well. These can be calculated as simple differences using two consecutive frames or in more general way as a linear combination of differences based on a wider time span (-L, L), making the estimation more robust.

(087)and (088)

In addition to these features energy information is sometimes extracted too. Obviously the segment’s energy itself does not provide much acoustical information to classify the segment. However; its evolution in the time copies the locations of vowels, fricatives or pauses that may be quite discriminative. That is why the energy dynamic is usually used as well.

Unfortunately, yet there is no feature that would ideally incorporate all the requirements mentioned in this survey. Thus new methods are emerging and some of them achieve even better performance, however it is usually observed mainly in particular sorts of environments and systems thus they may not have general applications.

Recognition Techniques

After the speech feature extraction process each speech signal is presented as a sequence of speech vectors. Therefore it is necessary to compare or asses each unknown sequence to the reference one (model or a sample of a speech sample that we know its lexical content, i.e. they were present in the training phase) that can represent word, phrase or a whole sentence. Based on the samples or models we use i.e. sub word, word or phrases, we distinguish recognition systems. If phrases or word models/ samples are use they can embrace the so called coarticulation effect (phonemes are influenced by adjacent phonemes and their positions in a word or even in a sentence) which plays a role. However sub word models are rather limited in their number thus they are more practical as each sample must exist in several different realizations in the training database prior the recognition process. Next, systems differ according whether they recognize isolated words or continuous speech which is substantially difficult problem as the word boundaries are unknown. The speech recognition process is a specific one because the sequences differ in their lengths. Furthermore, the variations (shortenings or prolongations) are nonlinear within words as some parts have almost constant lengths, but other phonemes are subject to great variability. Thus simple linear length normalization techniques like decimation (shortening) or interpolation (prolongation) are not so effective.

Summing all the requirements and adverse effects involved in the speech recognition, it is clear that more methods have evolved for the classification of speech sequences. However, the most successful ones are DTW and HMM, thus in the following both of them will be briefly introduced.

DTW

DTW (dynamic time warping) is a method that acoustically compares speech feature sequences of two utterances, reference and unknown one. Its main advantage is a nonlinear warping of a time scale during the comparison process. This is necessary to eliminate the differences in the sequences’ lengths. Furthermore, it compensates nonlinear length variations within words.

In its basic form the method requires to know word boundaries of both the referenced and test sequences. This involves the usage of additional speech detection algorithm which is also a rather difficult task especially in noisy conditions.

The method is based on a dynamic programming problem and nonlinearly warps time indexes for both reference and test sequences. This nonlinear and constrained mapping of a test sequence to a reference one follows the least acoustical error principle between them. However, this mapping must follow certain logical limitations like: beginning and end points of both sequences must be mapped on each other respectively, the functions warping the time for reference and test sequences must be non decreasing, there must be a natural (maximal allowed) difference threshold between real time indexes of mapped vectors. As a part of DTW calculation process there are used two matrices, local and global one. The local matrix simply contains acoustic distances of reference and test feature vectors among each other. The global one preserves the cumulated minimal distance (and path) between reference and test sequences calculated from the beginning to a certain position in a matrix. However, there are natural limitations on local directions, i.e. how it is possible to move from one point to subsequent ones. An example of a global matrix with depicted limitations on a global path and an optimal path is shown in the following picture.

A global distance matrix with the optimal path and limitations on a global path.

In the following picture the most common direction limitations and weighting of local paths are shown.

Two common local path limitations and their weightings.

This method had significant position in the area of speech recognition at the beginning of the research period, especially for isolated word recognition. However, as the requirements were growing, especially the speaker independence one and continual speech recognition, it lost its position to HMM method, which solves both problems in a mathematically elegant way.

Hidden Markov Models (HMM)

The method of hidden Markov Model is based on a statistical modeling of speech, rather than direct comparison of reference and unknown signals. Speech is decomposed into relevant units (from language and signal processing point of view), like phonemes, syllables, words, phrases, etc. As all statistical models are trained from multiple examples the speaker independence is relatively easy to implement having a large database. Further, the construction of HMM models allows for a simple concatenation of basic units (models) to form whole phrases, sentences or even a continual speech. The method is based on a first order Markov chain concept used for modeling static processes. By doing so it is mathematically easy to calculate the probability of a time sequence of discrete states. Thus the method is very effective even thought it violates the process of real speech production in time.

The Markov chain is defined by a set of static states S1,…, SN, transition matrix PNxN that gives transitions probabilities between states, and a vector of initial probabilities π for each state. Then matrix P and vector π are given as follows:

(089)

Having such model the probability of an occurrence of state sequence S1, S2, S3,.. SN is given by:

(090)

However, this model alone can not describe dynamic processes as speech signals. No matter what speech event a single state is related to, the speech events have many (theoretically infinite) different signal realizations. If not all, at lest the majority of them must be described by a single state. This can be achieved by adding another stochastic process that provides the probability of an observation (feature vector extracted from a speech) in a given state, i.e. P(X/Si). Thus each state will have addition probability process attached to it. However, this process has nothing to do with the time progress rather than with the variability of feature vectors (observations) in a state. It is vital as there are many (theoretically infinite number) signal realizations for each phoneme, etc. Combining these two stochastic processes we obtain Hidden Markov Models (HMM). There are three ways how to model observation probabilities in a state, thus we distinguish 3 types of HMM models:

Discrete HMM assumes that the observation vectors (speech features) are drawn from a finite discrete set. This can be achieved by applying vector quantization (VQ) to observation vectors. Then each observation vector is represented by a singe vector from a VQ code book. This code book is constructed over a training set so that the training vectors replaced by limited set of vectors achieve minimal (acoustical) distortions, e.g. if the code book has L vectors then any speech would be presented as vector sequence that consist of only L different vectors. Therefore each state must hold probabilities of these L vectors being observed in the state, i.e. L probabilities.

This is a simple and fast method and in the case of a present noise VQ may provide certain sort of noise removing process and thus the recognition process may be more robust to a certain degree.

However VQ process introduces permanent errors that may be significant especially in the case of small code books and small training databases.

Continuous HMM act as a standard technique in the current speech recognition systems that provide good results for variety of applications.

The probability of an observation in a state (P(x/Si)) is modeled by a weighted mixture of P Gaussian distributions given by:

(091)

Then such a probability is given by weighting coefficients (ci), mean vectors (μ) and covariance matrices (Σ).

To see an example of its modeling capability in a 2D feature space see the following figure.

Two Gaussian mixtures in a 2D space.

Such model may properly describe observation space having only limited number of training examples.

Its drawback is the increased computational complexity and memory requirements.

Semi continuous HMMs combine advantages of both discrete and continuous HMM. The process uses the global description of a feature space given by a large number of Gaussian distributions instead of a discrete code book vectors. These mixtures are shared by all states that reduces the amount of required parameters and the training and recognition processes are simplified. Then a state specific description is based on a global space description by weighting the outputs of global Gaussian mixtures, i.e. they use discrete probabilities of generating feature vectors from a particular global mixture given a certain state Si.

An example of a 4 state left-right discrete HMM model is shown in the following picture.

A 4 state left-right discrete HMM model.

Then a probability of observing a sequence of feature vectors (observations) of the length T on a model λ having N states is calculated using an auxiliary variable α as follows:

(092) where (093) and (094)

Final probability is given by:

(095)

Then the recognition process calculates the probability of an unknown sequence on all HMM models being in a dictionary and selects the one with the highest probability. The process is schematically depicted in the following picture.

The speech recognition process based on HMM.

Great advantage of HMM is a simple way how to concatenate different models in a single raw. By doing so, we can express any sentence consisting of vocabulary words.

In practical real time applications only the most probable path (sequence of states) is sought and calculated, i.e. P(x1,St1,x2,St2,…xT,StT). Having found this sequence of states it is possible to track back the models that created the words and further deduce upon whole sentences. This is also important from the computational time point of view. To disclose the most probable string of states (they are hidden) that produce the unknown observation the Viterbi algorithm is used.

The most difficult task related to HMM modeling is the training process, i.e. how to set all the unknown parameters (transition matrices, initial probability vectors, Gaussian mixture weights, mean vectors and covariance matrices) having several realizations of recorded speech samples. There are a few techniques that can be used but it was found there is no analytical solution of the problem. Therefore iterative methods are used that leads to local extremes only. Therefore the training process is very complex and undergoes several stages like: threshold settings, initialization of models, training of simple models, gradual enhancement of their complexity interlaced with training loops. There exist more methods how to train HMM models but the most famous and used is the Baum-Welche algorithm that is a maximum likelihood approach. Another question is the structure of models, usually left right transitions are allowed for speech events whereas each stable unit like phoneme is modeled by three states (beginning, middle and end stage). Usually each state has from 16 to 64 Gaussian mixtures with diagonal covariance matrices, but this number varies with the complexity of the task and the amount of available data. As any modeling method where the optimal structure of a model is unknown and using only limited set of data, the so called overtraining phenomenon may take place. Thus it is necessary to use a validation set to verify the performance of final models.

At presence the most advanced HMM systems use discriminative training like MMI, or MCE, and instead of Gaussian mixtures they use discriminative techniques like Support Vector Machines or Neural Networks (Deep belief networks).