Because of the variable speech characteristics and many adverse conditions mentioned in the previous text, there have been many extraction techniques invented through the time. Basically, a good speech feature must be:
As there are many different speaker specific features that have different physical meanings we distinguish 3 sorts of features (from the speaker recognition point of view):
At the acoustic level short time features are gathered that are related to physical characteristic of vocal apparatus. These methods mainly represent modified spectral (envelop) shapes extracted from intervals ranging form 10ms to 30ms. Further, they apply different psychoacoustic principles of human haring system to increase their robustness. At present the most common are Mel frequency cepstral coefficients (MFCC), Perceptual Linear Prediction (PLP), or Cepstral Linear Prediction coefficients (CLPC) features. MFCC and PLP try to capture modified spectral envelopes following some psychoacoustic principles like critical bands, human perception of frequencies, equal loudness curve, conversion of intensities to loudness, etc. As they are able to extract spectral envelopes they preserve and emphasise the location, widths and shapes of formant frequencies that are vital for the perception of differences among sounds. Thus they are very important for the speech recognition systems. Even thou they still play major roles for the speaker recognition problem as well. It can be explained so that they are able to capture slight differences in locations and shapes of formant frequencies that vary from person to person as present in particular phones. CLPC features are based on modelling the speech production mechanism instead of the hearing and perception process. Finally to encompass dynamic of acoustical features in the time, differential and acceleration coefficients may be derived as well. As they cover longer time intervals they may detect differences in co-articulation that are specific to a particular speaker.
The prosodic level focuses more on the style of speaking, mood of a speaker, specific speaking habits, physical and health conditions, etc. Obviously this information is located and can be extracted using only longer time intervals spreading several seconds of speech. The most favourite features for this level are: rhythm, speech dynamics, pace of speaking, modulation of fundamental frequency, sort of pauses made while speaking, etc. However these features are more difficult to measure and qualify as those on the acoustic level. Thus there are several methods to extract and evaluate them over the proper time interval. The most common approaches are the autocorrelation function, Average Magnitude Difference Function (AMDF) function, inverse filtering for the fundamental frequency detection, energy for the speech dynamics and so on. However there are many modifications to both autocorrelation and AMDF.