Genuine speech signals are created and produced by human beings; more precisely by their vocal apparatus and their brains which are unique to every individual. Both phases naturally leave their marks in the audible signal, and thus the speech can be regarded as a biometric signal.
Even though the main goal of speech signals is to convey lexical information it contains. Except the lexical part which is roughly given by the sequence of different positions of vocal organs it contains biometric information about any speaker represented mainly by different shapes, sizes, weights and toughness of vocal organs, actual mood of a person (intonation, speech pace, stress, etc.), and social background of a person (dialect, vocabulary, etc.).
However these different pieces of information are encoded in the speech signal by a difficult transformation which is believed to be irreversible and not known. Thus it is a difficult problem to extract just the information which is needed for a particular task (lexical, identification, mood, health state, …). Furthermore speech exhibits great within speaker variability given by individual’s current mood, health and physical state or other conditions. Finally the acoustical form of a speech signal can be seriously altered by differences in recording devices, room where it was recorded and present background noise.
The modifications of speech that are not related to a speaker (devices, room, etc.) are called session variability. This aspect causes major problems and must be dealt with in situation where the enrolment conditions do not match the actual deployment one.