Once proper speech features have been extracted the speech is in the form of a sequence of feature vectors, e.g. MFCC. Then the process of recognition roughly said takes samples or models of known speech units (phonemes, words, etc. from a training database) and compares them with the unknown speech sample, i.e. its feature vectors. Then the sample or model with the highest match (score) is claim to be the recognized word. Because of the special properties of speech signals, namely: every pair of unique signals of the same word differs by length (somebody speaks faster or slower, etc.). Moreover this length variability is not uniformly distributed along the time axis, thus some parts may last longer while the other may be uttered faster. Therefore the basic approach to solve length differences by a linear interpolation or decimation can’t be successfully applied here. Furthermore, based on the models the system uses it is usually necessary to concatenate a sequence of samples or models to represent certain word or even a whole sentence. These two main phenomena (non-uniform variability in length and concatenation of models) specific to speech gave a rise to the development of specific classification approaches. Currently the most common are Dynamic Time Warping (DTW) and Hidden Markov Model (HMM) methods. However there exist more modifications or even combinations that may be eligible in particular applications. Thus in the following a brief introduction of those two methods is provided.
DTW is an abbreviation to dynamic time warping which is a method that acoustically compares sequences of two speech feature utterances (reference and the test one). It is based on nonlinear time warping during the comparison process so that these two sequences are as close as possible (evaluated by a proper acoustical measure). Thus it compensates for nonlinear variations in lengths within words.
To do so the first and last vectors in of the two sequences must be aligned. Therefore this approach requires the knowledge of word boundaries in prior, which may be a tricky task by itself while it is to be done automatically. However there exist modifications to DTW that relax this strict requirement.
The method briefly tries to find a mapping between two sequences of vectors of different lengths so that each vector has a partner vector from the other sequence to be compared with. This means some vectors at particular time may be omitted or one vector can be mapped to more vector from the other sequence. Of course this process can be done arbitrary so this mapping must follow certain logical limitations i.e. beginning and end vectors of a sequence must be mapped to their counterparts in the second sequence, the warping functions must be non decreasing, there is a maximal allowed discrepancy that this nonlinear mapping can overcome (usually vectors whose indexes are more than twice of each other can’t be compared, etc.). In the DTW calculation process two matrices are used (local distance and global distance). In the local matrix there are stored acoustic distances between reference and unknown feature vectors. The global one is used to calculate the path (mapping) and to accumulate the minimal distance along the optimal path. Thus there is a minimal distance and an optimal path related to any element of the global matrix that connects its position to the beginning point which is in the left right corner. This situation is depicted in Fig. 3.6. Of course, there are natural limitations on directions how to move from one point to subsequent ones (non-declining in both horizontal and vertical directions). Once the search process reaches the end point of the global matrix (upper right corner) the comparison is over and the distance is found. This process repeats for every word from a dictionary and the word with the least global distance is claimed to be the unknown one. As it can be seen this method is eligible for recognition of isolated words, or commands.
DTW had a significant position in the speech recognition, especially for the problem of isolated word recognition that is speaker dependent. However, as the requirements were growing e.g. speaker independence and continual speech recognition, it was gradually losing its position to HMM method.
Hidden Markov Model is a statistical modeling technique mainly used for speech recognition that solves both the speaker independence and concatenation of basic models (to form word, phrases, sentences and even covers continual speech) in a mathematically elegant way.
For each selected speech unit (phonemes, syllables, words, …) a HMM model is created having certain structure. Usually all models share the same structure and they differ only by free parameters of the model. In the training process just those free parameters are set using the training database. Training database consist of speech utterances that are labeled so that it is known what was exactly said. Parameters of HMM models are mostly adjusted in such a way that the models describe the training data with the highest probability, the so called maximum likelihood criterion. However, some systems that use different strategies based on maximal separation between models or minimizing error rates (discriminative criteria) may provide more accurate results.
Each model consists of several states that are connected to each other. With each connection there is associated a transition probability (p). Further there is an initial probability (π) for each state which is the probability the model starts in the given state.
Having such model the probability of a state sequence S1, S2, S3,.. SN is given by:
Moreover there is an additional probability associated with each state and that is the probability of observing a feature vector X in a given state S i.e. P(X/S). Then the probability of observing sequences of feature vectors X1,…, XM, and states S1, S2, .. SN is as follows:
An example of a 4 state left-right HMM model is shown in Fig. 3.7.
Then the recognition process calculates the probability of an unknown sequence on all HMM models in a dictionary and chooses the one with the highest probability. The process is schematically depicted in Fig.3.8.
At presence the most advanced HMM systems achieving the lowest word error rates (WER) use different strategies like Maximal Mutual Information (MMI), Minimal classification error (MCE), etc. Both MMI and MCE belong to the discriminative training. Finally there are very successful hybrid connections with other classification methods like Support Vector Machines or Neural Networks providing even lower WER on unseen data.
More detailed description of the speech recognition technology and used methods can be found e.g. in [7].