When we compress a generic audio signal, there are numerous coding standards and compression approaches to choose from. Many of them focus on specific types of audio (i.e. speech) or parameters (computational complexity, delay, etc.).
Sampling frequency describes how many signal samples are obtained every second. Generally, the higher the sampling frequency the higher the precision and quality of the recording can be obtained. The following are most used sampling frequencies: 8 kHz, 16 kHz, 22,5 kHz, 32 kHz, 44,1 kHz or 48 kHz for each audio channel.
Auditory masking is a phenomenon observed as an imperfection in the human auditory system. The ears have limited possibility to hear all sounds, which is described by the absolute threshold of hearing. Additionally, loud sounds often cover more quiet sounds which occur close to them. This can happen in both time and frequency domain, dividing the masking into:
The loud sound is called the masker. When two sounds occur in the same time simultaneous masking may occur. The masker creates the masking threshold below which no other sounds could be heard. If the signal is close to the masker and falls below the threshold it will be masked. The following image shows how the masker can hide a silent signal in the frequency domain. Combination of maskers’ masking thresholds and the absolute threshold of hearing leads to creation of the global masking threshold which may change over time.
In the temporal, or non-simultaneous, masking the masker can mask a signal which occurs closely before (premasking) or after (postmasking) the masker. Again, the intensity of the masker must be higher than the signal’s intensity.
Frequency masking has been explored in fairly good detail and is widely used in plenty of audio codecs (as will be shown later on). Temporal masking, on the other hand, has not been examined with the same precision. This is mainly due to the duration of pre- and postmasking. Postmasking takes place up to 300 ms after the end of masker while premasking only lasts some 50 ms or less. These times are too short to be explored precisely because codecs usually work with 20 ms frames or larger, making use of only 2 or 3 frames for premasking.
Currently, the most used audio codecs are based on the work of the Motion Picture Experts Group (MPEG) which is a part of the International Standards Organization (ISO). During its existence the group introduced several audio formats which gained worldwide usage.
As it will be obvious later on, these codecs are based on lossy coding which means they modify the original audio signal and the output is never the same as the original.
The MPEG-1 standard represents a flexible coding technique, employing several methods, i.e. subband coding, filter bank analysis, transform coding, entropy coding and psychoacoustic analysis. It works with sampling frequencies of 32, 44.1 or 48 kHz with 16 bits/sample and the output bitrate varies from 32 up to 192 kbit/s per channel. The standard offers 4 channel modes, namely mono, stereo, dual mono and joint stereo (layer III only).
The standard’s architecture contains 3 levels which differ in computational complexity, delay and output quality. Layers I (mp1) and II (mp2) are similar and only differ in several details. While both layers employ fast Fourier transform (FFT), the window size is 512 samples for layer I and 1024 samples for layer II. Maximum subband quantization’s resolution is 15 bits/sample in layer I and 16 bits/sample in layer II. Even though these differences seem minimal it has been shown that layer II provides same or even higher quality output at 128 kbit/s than layer I at 192 kbit/s per audio channel.
The compression in both layers I and II works with a PCM input signal which is divided into 32 subbands. During division, FFT is performed in order to perform psychoacoustic analysis and determine the jnd. Based on the masking threshold of each subband suitable quantization steps are decided so that required bitrate and masking level is maintained. The output is then coded using Huffman’s entropy coding.
Although MPEG-1 layer II provides acceptable results, the still ruling format is the MPEG-1 layer III, commonly known by its shortcut mp3. It is based on layers I and II, however, it adds a few techniques which lead to lower bitrate (64 kbit/s per channel) while maintaining the same quality as its predecessors.
The algorithm takes 1152 samples and divides them into 2 granules (576 samples each). Each of the granules passes through the hybrid filter bank (a set of band-pass filters used to split the input into subbands: each subband may then be processed individually) in order to improve frequency resolution. Each subband is then transformed into frequency spectrum using Modified discreet cosine transform (MDCT). Then, bit allocation and quantization are performed iteratively, when, during each iteration, a process of analysis-by-synthesis is performed to determine the quantization noise level.
Modified discreet cosine transform (MDCT) is derived from the discreet Fourier transform (DFT) but is specifically designed for signals with overlapping blocks of samples. It decomposes, or transforms, the input signal into a set of cosine functions. Compared to the Fourier transform whose output is a set of complex numbers, the MDCT output is set of real numbers representing the cosine functions. Moreover, DFT outputs the same number of coefficients as was the number of input samples, while MDCT, due to its overlap feature, outputs half the number of coefficients than there are input samples.
There are 2 extensions to the original layer III format, namely the MP3pro and mp3 surround. MP3pro adds a technique called Spectral Band Replication (SBR) which is used in lower bitrates to remove the original higher frequencies. They are then reconstructed from the compressed signal using side information. mp3 surround allows to compress a 5.1 channel audio (five full range channels and one low frequency - bass - channel) into mp3’s 2 channels from which the 5.1 channels can be reconstructed using side information added to the file. Side information is ignored by a non-supporting decoder and the file is played back as an ordinary mp3 file.
The MPEG-2 is a formal successor of MPEG-1. It comprises 2 modes, one being Backward Compatible with MPEG-1 (MPEG-2 BC) while the other, MPEG-2 Non-Backward Compatible (MPEG-2 NBC) leaves backward compatibility in favour of new methods and coding techniques.
The MPEG-2 BC only brings support for lower sampling frequencies (LSF) and multi-channel coding and is quite similar to mp3 surround. The MPEG-2 NBC is also known as Advanced Audio Coding (AAC) and is created as a set of tools for effective coding. The more tools are used the better compression is achieved while keeping the same quality, at the cost of higher complexity and delay. Unlike MPEG-1, it no longer uses hybrid filter bank to analyse signals, only MDCT is used and transform function uses different window functions. The MPEG-2 AAC became part of the MPEG-4 family of standards.
The MPEG-4 AAC attempts to conquer the rule of mp3 format. It provides support for sampling frequencies from 8 up to 96 kHz, 1 to 48 audio channels plus 15 bass channels and additional 15 data channels with 8, 16, 24 or 32 bits per sample. The Low Complexity (LC) AAC represents the original MPEG-2 AAC codec and is suitable for speech coding at 8-12 kbit/s. The High Efficiency (HE) AAC adds the SBR technology (v1) and parametric stereo (v2) which is based on joint stereo profile of MPEG-1 layer III.
The Vorbis audio codec is one of the most successful open source codecs. Since 2000 when it was standardized it became the direct competitor to MPEG’s mp3. It supports sampling frequencies from 8 up to 192 kHz, up to 255 channels and its output bitrate is by default variable.
Its coding process is different from MPEG: first, the signal is transformed using MDCT. Then, a so-called floor is calculated as a rough approximation of the spectral envelope (a curve which connects all amplitude bins in the frequency spectrum) using split linear function. The difference between the spectrum and the floor is then coded using multi-pass vector quantization.
Ogg Vorbis has higher memory requirements than mp3 because its header contains also entropic code table (mp3 uses fixed table) and other settings for the decoder. Nevertheless, it is highly suitable for compression of generic audio signals and provides similar or higher audio quality at the same bitrate as the mp3 codec.
Windows Media Audio (WMA) is a proprietary codec created by Microsoft in response to mp3’s licensing requirements. There are several variants of the codec: the WMA 9 is a direct mp3 competitor with support for sampling frequencies up to 48 kHz with 16 bits per sample and output bitrates ranging from 64 to 192 kbit/s, supporting both CBR and VBR.
The WMA 10 Professional extends the codec’s possibilities to compete with MPEG-4 AAC by adding sampling frequency of 96 kHz with 24 bits per sample for 7.1 channels. If the device is not capable of 7.1 playback the signal is automatically degraded to parameters (sampling frequency, bits per sample and channel downmix) suitable for the device. This suggests that the codec utilizes a technique similar to that used in mp3 surround.
The WMA 10 also provides modes for speech compression called WMA 10 Voice, which has output bitrate from 4 up to 20 kbit/s. Its speciality, however, is the ability to dynamically switch between the voice and standard codec if the audio signal is too complex. Additionally, WMA 10 provides a Lossless version which is claimed to be able to reduce the file size of the original PCM signal to half or even third of its size.
The WMA 10 Professional codec provides higher quality at 64 kbit/s when compared to MPEG-4 AAC v2 in 70% of comparisons.