3Communication via speech commands

Communication with machines via speech commands falls into a vast research domain called automatic speech recognition (ASR).

This term generally means automatic (by a machine) transcription of a spoken language. The input is a speech signal mostly present in a digital form as a sequence of numbers. The output is a text in the form of strings of words that exist in a vocabulary (as it can be rather extensive usually only restricted vocabulary is used for a particular domain). Furthermore the output string follows either regular grammar rules or frequent patterns that can be observed in spoken language (statistical language modeling). Thus the task is often called as speech to text problem.

ASR should be distinguished from the task of understanding of what was said which operates on higher level (the input is a text) and the branch of science that deals with this problem is called an artificial intelligence.

For several decades there has been a big effort going on to construct an ASR system that could be widely used especially in the following areas: information retrieval systems, dialog systems, aids for handicapped people, etc. But it was only recently since domain oriented applications came out of laboratories. Currently as the technology and knowledge made crucial steps other more sophisticated applications like dictation systems or even automatic transcription of natural speech are emerging. It is because the task is so complex facing many obstacles that must be solved by different domains of science. Practical systems for general public should operate in real and mostly very adverse conditions (great variability of noises, recording devices, employment places, etc.), must accept great variability of spoken language (rules are rather lose), speaker variability (in term of an acoustic form), huge vocabularies, just to mention few of them. Furthermore a general user requires an immediate system response (working in the real time), he is not willing to change his speaking habit or to restrict his vocabulary and quickly looses his patience if the system doesn’t work with a high accuracy.

As the range of possible ASR applications is quite wide and so is the complexity of the systems required for solving particular tasks (actually it is growing even much faster with the growing requirements) there exist several classes of systems according to which the ASR systems are classified. The main criterion related to ASR classification is based on the vocabulary size, and we distinguish:

However these numbers change as the technology progress.

Then systems can be speaker dependent or speaker independent that means whether they can equally operate regardless of who is speaking. Further, it is important if the system provide an immediate response or works off line, thus we have real time systems or off line system. Next it is quite vital to know in which form they expect the speech samples to be processed. Thus we distinguish:

Finally we can classify systems based on what speech units they model (phonemes, word, syllables, phrases …) and what models they use, e.g. statistical modeling like hidden Markov models (HMM).

A speech signal is produced by human vocal organs and is observed through air vibrations. Except many other it carries lexical information (what was actually said) that is represented as a sequence of different acoustic sounds. Set of basic sounds that are called phonemes is used to build words of a particular language. Their number may vary among languages (usually from 40 to 60). However, their acoustic forms differ from speaker to speaker and neighboring phonemes influence each other (co-articulation phenomenon).

To suppress all information that is irrelevant or may even hinder correct speech recognition should be removed prior to the recognition process. Basically we are interested only in the lexical information so the remaining one like speaker identity, mood, actual health condition, dialect, speech impairments and habits should be suppressed. It is partially the task of a speech extraction method that is to pick up only the needed information that is to be processed by following blocks. There is approximately 10b/s lexical information stream while the bit rate for speech signal is about 100kb/s. Thus the extraction method can be looked at as a bit rate compressor.

The aim is to simulate the auditory system of humans, mathematically describe it, simplify for practical handling and optionally adapt it for a correct and simple use with the selected types of recognition and classification methods.

There are several proper feature extraction methods that either simulates the speech production process or they mimic a human auditory system (critical bands). It is so because the auditory system has evolved just to focus on relevant information suppressing ubiquitous noises and distortions.

Through years of the research it was discovered that except time domain the most significant discriminative information to classify phonemes between each other lays in the frequency domain. More precisely, it is located in the position and shape of dominant frequency components. To demonstrate this, in Fig. 3.1 there is a frequency spectrum for a vowel “e”, its magnitude envelop with depicted formant frequencies (major spectral peaks). In Fig. 3.2 the time representation of the corresponding vowel ‘e’ is given.

image
Figure 3.1. A spectrum, formant frequencies and a spectral envelope for a vowel “e”.
image
Figure 3.2. A signal of a vowel “e”.
image
Figure 3.3. A spectrum of a phoneme “t”.
image
Figure 3.4. A signal of a phoneme “t”.

To see the time and frequency differences between various phonemes, in Fig. 3.3 there is depicted the spectrum and in Fig 3.4 the time course of a phoneme “t”. To sum it up, in the following table first two formant frequencies for vowels are listed separately for males and females (average figures gathered over population). These positions provide a very rough but a simple way how to classify phonemes based on their spectral representations.

Table 3.1. First two formant frequencies observed in common vowels for males and females

vowel

Males

Females

F1 [Hz]

F2 [Hz]

F1 [Hz]

F2 [Hz]

a

730

1100

850

1200

e

530

1850

600

2350

i

400

2000

430

2500

o

570

850

590

900

u

440

1000

470

1150

Thus extracted features should be able to estimate and discriminate those differences in formant frequencies. On the other hand they should neglect variations that are natural and inaudible. In the following table the most relevant audible and inaudible spectral modifications are listed.

Table 3.2. Audible and inaudible modifications of speech spectra

Sort of modifications

Audible

Inaudible

Number of formant frequencies

Overall tilt of the spectra

Position of formant frequencies

Frequencies under the first formant frequency

Widths of formant frequencies

Frequencies above the 3rd format frequency

-

A narrow band stop filtering

Moreover intensity of signals is perceived non-linearly that can be approximated by a logarithm function.

From the noise point of view features should be insensitive to additive and convolutional noises. Last but not least a good feature must be easy to implement, mathematically tractable, and has a compact representation. Usually it is beneficial if features are linearly independent from each other that ease their subsequent processing.