Compression Techniques
Compression of speech

Even though speech is in its nature audio signal it has some specific characteristics which allow us to use more radical compression techniques than with generic audio signal. Firstly, speech signal is meant as a medium to express information. The information doesn’t have to be articulated in the exact same manner as the original in order to be understandable. This implies audio characteristics can be reduced. For example, standard phone call is sampled with sampling frequency of 8 kHz (compared to 44.1 kHz with standard audio sampling frequency), which means only 4 kHz bandwidth can be acquired. The bandwidth contains most of the speech’s energy, and information.

Secondly, speech signal is relatively simple compared to a recording of a rock band, when usually there is only one speaker at a time and no other instruments. Moreover, to obtain the clearest speech we apply algorithms to suppress background noise.

Techniques used in speech compression can be divided into following groups:

For the purposes of quality assessment of the various speech processing algorithms a measure exists that is called intelligibility. It describes how comprehensible and understandable the speech is. There are various aspects of speech that are taken into account, for example speech level, non-linear distortions, background noise level, echoes and reverberations and others. There are two main scales for the measure: the Speech Transmission Index (STI) and Common Intelligibility Scale (CIS), both ranging from 0 (worst) to 1 (best), or 0% to 100%. Generally, it is desirable to achieve at least 0,5 (or 50%) for the speech to be understandable.

Time Domain

Waveform coding in the time domain is represented by the PCM technique. While linear PCM uses equal distances between quantization levels, non-linear PCM uses non-linear quantization steps, or, as a modification, the dynamics of the input signal is compressed (companded) by the transmitter and expanded by the receiver.

There are two compansion characteristics defined in the G.711 recommendation, the µ-law (USA and Japan) and the A-law (Europe). For example, the A-law characteristic is given by:

(084)

where sgn(x) is ±1 for positive or negative value of x and A is the compression parameter. Usually, A=87,7.

An example of the A-law compansion curve. Note that higher frequencies (represented on the horizontal axis with higher numbers) are encoded with less values than lower frequencies.

Frequency Domain

In the frequency domain, subband coding and adaptive transform coding are used. In Subband Coding (SBC), the speech signal is split into several frequency bands using a set of band frequency filters (filter bank) and the signal is decimated to reduce the number of samples. Then, each subband is coded, most often using the ADPCM method which allows flexible quantization and bit assignment.

An example of the filter bank

Alternatively to ADPCM, other methods based on Adaptive Transform Coding (ATC) may be used. In these, the signal is transformed into frequency domain by applying FFT or other transformation and split into subbands. Then, bits are assigned dynamically to the samples in the subbands according to each band’s need.

Linear Predictive Coding

Natural human speech can be understood as a response of the vocal tract of the speaker to excitation signal, in our case the air exhaled from lungs. The output signal is then modelled by changing the properties of the vocal tract (vocal cord, oral cavity, teeth, etc.). If we looked at the process from the signal analysis point of view, we could represent the output signal using the excitation signal and a filter representing the vocal tract, with time-varying parameters, the coefficients which are re-calculated from frames of 10 to 30 ms. Even though there are various methods to describe the coefficients of the vocal tract function the most used method is based on linear prediction, hence the name Linear Prediction Coding (LPC).

General scheme of the LPC decoder

The LPC coefficients minimize the quadratic variation between original and predicted speech samples. As it can be seen, the model of the LPC speech generator consists of two parts, the same as mentioned before:

The vocal tract excitation is represented by the pulse generator and noise generator, which are switchable depending on voice of the piece of speech. The excitation is further amplified by gain (G) to the required level.

The vocal chords vibrate with a specific base frequency f0 which leads to the base period T0. The higher the frequency the higher the pitch of speech.

Depending on the use of vocal chords, the speech sounds can be divided into:

The vocal tract filter is given by a linear digital filter with finite response (FIR filter) whose transfer function is given by:

(085) ,

where ai are the filter coefficients and p is the order of the filter. When S(z) represents the output sample and E(z) is the excitation, we get the next sample s(n) as a linear combination of the previous samples with the excitation G.e(n):

(086)

To be able to use the LPC speech generator the following information has to be determined for each frame:

The bitrate of the LPC encoded speech signal varies from 1.2 to 2.4 kbit/s and its intelligibility is around 80-85%.

However, the reconstructed signal sounds machine-like, which is given by the two main factors:

  1. It is difficult to segment the speech exactly to voiced and unvoiced frames as there are more types and combinations of the two in natural speech.
  2. The base period in natural speech (which is the characteristic of the speaker’s voice) changes more often than is the frame length and the change is not periodic.

There are methods which suppress the imperfections of the LPC method by coding the residue between the original signal and the LPC-predicted one.

The Residually Excited Linear Prediction (RELP) transfers the difference between the original signal and the LPC-reconstructed signal and transfers the residuum directly. At the receiver side the LPC coefficients are used to generate the reconstruction and then the residuum is added to form more precise reconstruction.

A successor to RELP algorithm is the Code Excited Linear Prediction (CELP). The algorithm is based on the analysis-by-synthesis principle and performs perceptual optimization of the synthesis signal in a closed loop. Then, a fixed codebook is searched for the most suitable excitation function and only the position in the codebook is transmitted along the LPC coefficients. Alternatively, the excitation function can be encoded using vector quantization.

The method achieves bitrates from 4 to 8 kbit/s. Its disadvantages are relative computational demands and delay at around 35 ms.

The CELP’s modification Low Delay CELP (LD-CELP) reduces the delay to 2 ms while using 16 kbit/s bitrate. It is a part of the ITU-T’s G.728 standard. Another codec based on CELP technique is Speex, an open source codec from organization Xiph.Org, the author of the Ogg Vorbis.

Sinusoidal coding

Sinusoidal coding derives from the theory that any audio signal is a combination of a deterministic and stochastic signal. The deterministic part can be therefore represented by harmonic functions (sines, cosines) while stochastic part can be modelled by noise or other parameterization. The principal scheme of such coder is given below. Sinusoids are time-changing frequency connections which are believed to form one tone.

General scheme of the sinusoidal coder

This model, however, cannot sufficiently model quickly changing parts of the sound so a third part has been added, the transients which model fast changes in the signal. This leads to the sinusoids+transients+noise (STN) model.

Another extension to the basic SN model is Harmonic + Individual Lines + Noise (HILN) model. In this approach, the sinusoids are split into two groups, harmonic part and individual part. In the harmonic part, the sinusoids are represented as harmonic multiples of the basic frequencies and only the multiples are stored. Then, individual sinusoids are encoded and the residuum is treated as noise.

Sinusoidal coding is expected to deal well with simple signals which consist mainly of harmonic sounds, such as speech. Based on the technique was SKYPE’s first codec SVOPC which achieved good quality at 20 kbit/s and was robust against packet loss.

However, its computational demands led to creation of a new, LPC-based SKYPE codec named SILK.

On the basis of the SILK codec, and combined with properties of the Constrained Energy Lapped Transform (CELT) codec, a new codec has been standardized in September 2012, the Opus codec. The codec is able to utilize SILK’s good performance at low frequencies and low delay of the CELT codec at higher frequencies, and switch between the two on request. The codec is highly capable of encoding both speech and audio and is suitable for online applications such as VoIP and live broadcast.