First it should be noted there is no feature that would completely fulfill all the statements mentioned above. Therefore the research is still active and many acoustic speech features have been designed but the most commonly used are Mel frequency cepstral coefficients (MFCC) and Perceptual Linear Prediction (PLP). PLP and MFCC try to simulate the human auditory system which results in good performance in speech recognition tasks. Both are able to capture positions and widths of formants that are most perceivable. Despite obvious similarities they differ in the psychoacoustic phenomena they encompass.
MFCC applies high pass filter (suppression of the lip radiation), segmentation of speech by Hamming window that is followed by a conversion to the spectrum by DFT. Next the spectrum is non-linearly warped into the Mel scale (psychoacoustic scale that reflects human perception) over which equally spaced triangular windows with 50% overlap are placed to simulate a filter bank (see Fig. 3.5). In the final stage of the calculation logarithm and Discrete Cosine Transform (DCT) transforms are applied. Moreover DCT suppress dependency among features.
PLP features differs from MFCC in several aspects as follows: the usage of Bark frequency scale, smoothing and sampling the bark-scaled spectra in 1 bark intervals, equal loudness weighting, transformation of energies into loudness, calculation of a linear speech production model, and its transformation into a cepstrum.
Thus PLP applies more complex psychoacoustic processing than MFCC; however both of them usually produce similar results for speech recognition and laboratory conditions.
Speech is basically a particular sequence of different sounds thus it makes sense to measure and evaluate proper transitions between them. Most common method to do so is via delta and acceleration coefficients constructed over acoustic features in the time. They can be computed as differences between two adjacent frames or in more general case as a linear combination of differences covering a wider time span. Furthermore, it was shown that an energy envelop can locate the position of high-energy vowels and low-energy unvoiced consonants that augments the overall discrimination information. So the energy (normalized) feature is often added to the acoustic features as well.