Compression Techniques
Compression of video

Video, or moving pictures, is the most common part of multimedia these days. The amount of video content rises rapidly and takes by far the most data space compared to other types of media. With the growing possibilities of personal devices in terms of video creation and playback the amount of data stored and transmitted rises every day. Therefore it is necessary to reduce the size of video to lower the transmission and storage costs.

Video sequence consists of individual frames, or images, which can be encoded in fundamentally same way as static images. However, a sequence of rapidly changing images is expected to involve the same object changing its position over time, making two frames following each other similar with only small changes. Thus, if we only encode the changes we can achieve even higher compression.

Video coding methods can be divided into two main categories:

Waveform coding methods employ transform coding with intra-frame image prediction and are widely used in a variety of formats. Model based image coding is used in limited bandwidth applications, such as video telephony, with bitrates up to 64 kbit/s. The methods exploit typical video telephony scenes with constant content, minimal movement and reduced frame rate.

Intra-frame prediction and motion compensation

To reduce the time redundancy between two frames the intra-frame prediction with motion compensation which works in two stages:

During motion estimation a motion vector is determined which covers relative movement of an image block from the previous to the current frame. If the position is predicable then only the motion vector is necessary to transmit. There are two algorithms for motion estimation:

Pel recursive algorithms attempt to iteratively minimize the prediction error. Since they highly depend on local statistical distances they cannot estimate greater distances and are suitable for slow motion video, i.e. video telephony. The block matching algorithms presume that all parts of the image block move the same direction. The current frame is divided into blocks and for each of them the most similar block is searched in the previous frame.

Motion compensation algorithms use the motion vectors to move each block from the previous frame into the new position in the current frame. Thus, the prediction frame is created which is coded and transmitted.

Video compression techniques

Interlaced video

Standard (progressive or non-interlaced) video consists of 25 (Europe) or 29.97 (USA) frames per second. Interlaced video consists of double the amount of half-frames per second (50 for Europe or 59.94 for USA). Each half-frame contains either odd (top field) or even (bottom field) lines of the standard video frames which are decoded and played back in interlacing order.

Difference between progressive and interlaced video. A full frame is decomposed into top and bottom fields, creating half-frames which are alternately put in a sequence. The video has then double the frame rate per second.

Colour subsampling

Another coding technique is colour subsampling. Each pixel of the video frame encoded in YUV colour space consists of three subpixels: one luminance (brightness) and two chrominance (colour) pixels. For the set of four pixels in square this is denoted as 4:4:4. However, the human eye is more sensitive to brightness than to colour change. This allows us to reduce the number of chrominance pixels to half or even three quarters. In the first case the four luminance pixels are covered by four chrominance pixels (4:2:2), and in the second case four luminance pixels are covered with only two chrominance pixels (4:2:0 or 4:1:1).

Examples of colour subsampling. The left image shows that each luma subpixel has its own pair of chroma subpixels. In the middle, each two luma subpixels share a pair of chroma subpixels. On the right, four luma subpixels share one pair of chroma subpixels.

Similar to static images, the video compression techniques involve mainly hybrid methods combining time domain coding with transform coding (such as discreet cosine transform (DCT) or discreet wavelet transform (DWT)). The prediction frame created in the process of motion compensation is subtracted from the current frame creating the error frame. The error frame is block coded using DCT and the transformation coefficients are quantized and “zig-zag” coded using variable length coding (VLC). The VLC sequence is then transmitted.

On the receiver side, the reconstruction is performed the inverse way, performing inverse VLC, inverse quantization and inverse DCT.

The described approach is used with modifications in both MPEG and H.26x coding standards.

MPEG

The Motion Picture Experts Group (MPEG) was assigned by the ISO and IEC organizations to create standards for audio and video compression. As a result, following standards have been proposed, each focusing on different application:

The MPEG-1 focuses on interactive systems based on CD-ROM media. MPEG-2 extends the capabilities of MPEG-1 for digital television and High Definition TV (HDTV). MPEG-4 focuses on multimedia applications with very low bitrate.

MPEG-1

The standard was created to encode the video signal with sufficient quality at bitrate of 1.4 Mbit/s. It supports fast forward and backward playback and image freeze. While typical frame size is 352x288 px (CIF) the codec supports frame size up to 720x576 px at 30 fps and 1.86 Mbit/s.

The intra-frame coding in MPEG-1 is based on intra-frame prediction and DCT coding. There are 3 types of macroblocks (sets of four luminance blocks and two chrominance blocks) leading to three types of frames:

The I frame, in blocks of 8x8, is only DCT coded, coefficients are quantized and “zig-zag” ordered and finally entropy coded. No motion estimation is done so the frame appears as a screenshot, is independent from other frames and serves as a stop point during fast forward or backward playback.

The P frame is coded with intra-frame prediction coding and compared to the previous frame. Thus, the error frame is created (represented by the motion vector) which is split into 16x16 pixel macroblocks and DCT coded, quantized and entropy coded similarly to the I frame. However, these frames do not contain the whole image information since they depend from the previous frames and only serve as the reference frames for prediction, not for fast forward or backward playback.

The B frames are only acquired using forward or backward prediction from I or P frames. B frames usually serve as padding or added details in fast scenes between I and P frames which contain the same information as P frames, anyway. No prediction is done out of B frames.

The ordering of I, P and B frames and their dependencies

The frames may be combined flexibly to fulfil the application requirements. The IIIIIIIIII sequence offers high frame access and fast forward and backward playback, however, low compression applied to I frames puts high demands on the necessary bandwidth. Typically, a IBBPBBPBBPBB(I) sequence is used (called the Group of pictures (GOP)), applying the I frame approximately every half a second.

MPEG-2

MPEG-2 is an extension of the previous standard (with backward compatibility) with the possibility of interlaced video, improved maximum image resolution, TV video quality at bitrates of 4 to 8 Mbit/s and HDTV video quality at 20 Mbit/s.

Through profiles and levels various properties of decoders are specified. Each level defines a set of parameters specifying the target application of the video while profiles specify complexity of the used algorithms. The following tables describe levels and profiles in detail.

List of MPEG-2 levels

Level

HIGH

HIGH 1440

MAIN

LOW

Parameters

1920x1152 px

60 fps

80 Mbit/s

1440x1152 px

60 fps

60 Mbit/s

720x576 px

30 fps

15 Mbit/s

352x288 px

30 fps

4 Mbit/s

List of MPEG-2 profiles

Profile

Algorithms

High

All Spatial Scalable profile functions plus 3-layer spatial scaling and SNR scaling modes

Colour model YUV 4:2:2 for demanding tasks

Spatial Scalable

All SNR Scalable profile functions plus 2-layer spatial scaling coding mode

Colour model YUV 4:2:0

SNR Scalable

All Main profile functions plus 2-layer SNR (Signal-to-Noise ration) scaling coding mode

Colour model YUV 4:2:0

Main

No scaling, interlaced video coding

Random frame access, prediction mode with B frames

Colour model YUV 4:2:0

Simple

All Main profile functions supported except for prediction mode with B frames

Colour model YUV 4:2:0

Video scaling

Scaling enables decoders to play back low-bitrate video in the case they are not able to play back the high-bitrate video. The decoder receives the low-quality video and additional information allowing scaling the quality up. Using SNR scaling, the DCT coefficients are quantized roughly for the low-bitrate video. Then, the difference between the rough quantization values and the true value is quantized again using finer quantization and the information is sent separately to enable on demand video quality upscaling. With spatial scaling, the video is first encoded with lower resolution and the additional data are added to enable higher resolution. If the decoding device doesn’t support higher resolution it omits the additional data to decode only the low resolution video. Temporal scaling works in a similar way. The video with reduced frame rate is created and additional data are added to allow reconstruction of higher frame rate video. Spatial and temporal scaling can be combined to support variability in video coding, i.e. to support both HDTV and SDTV systems.

Note that MPEG-2 standard has been developed together by organisations ISO and ITU-T who named it H.262.

MPEG-4

MPEG-4 standard was developed in order to support very low bitrates up to 64 kbit/s. Its target was to enable video over the Internet, in mobile devices and networks, and to support interactivity with the objects in the scene. This required improvement of the compression methods which now utilize video object coding in both natural (standard) and synthetic (wire-frame objects coding) video.

There are two versions of the MPEG-4 video standard. The first is referred to as Part 2, which is used by variety of codecs including DivX, XviD, Nero Digital and others. The second version is referred to as Part 10, also known as MPEG-4 AVC/H.264 Advanced Video Coding, and is used in x264, Quicktime or in HD video media, such as Blu-ray Disc.

Natural video coding is performed by detection and coding of video object planes (VOP). Each VOP contains information about shape and texture of the object in the scene. A sequence of VOPs representing the same object is called the video object (VO). Each video object can then be coded using different bitrate, allowing for flexible bitrate allocation and additional object manipulation (scaling, rotation, brightness and colour variation).

Demonstration of the usage of video objects

The video object is given by its shape given by either binary or grayscale shape mask. Motion coding is based on similar principles which are used in earlier MPEG standards but are applied to the video object planes, creating IVOP, PVOP or BVOP frames. The spatial redundancy is removed using DCT transform while temporal redundancy is dealt with by applying motion compensation. The texture of the video objects is, again, coded using a modification of DCT, the Shape adaptive (SA) DCT. Additionally, an alternative coding is permitted using SA-DWT (Discreet Wavelet Transform).

Synthetic video allows for creation of artificial objects and mix them with existing video objects in the scene. Main target is to enable facial animation for multimedia applications.

H.261 and H.263 standards

The H.261 standard was published in 1990 to enable video telephony and video conference with low bitrates 64 to 1920 kbit/s and low delay. It unites the various television standards using different row count and half-frame frequency (PAL and SECAM using 625 rows at 50 Hz, NTSC using 525 rows at 60 Hz). As a basis, it uses the CIF (352x288) and QCIF (176x144) video formats where one is used for video conferences with more participants and the other for video telephony transferring usually head and shoulders of one person.

Both CIF and QCIF are split into groups of blocks (GOBs), the CIF into GOB 1-12 and QCIF into GOB 1, 3 and 5. Each GOB is then split into 33 macroblocks, each containing 6 blocks: 4 brightness (luminance) and two colour (chrominance – CR, CB), each comprising 8x8 pixels.

The H.261 codec uses only I-frames (called the keyframes) and P-frames which are obtained using the motion prediction from I-frames or the previous P-frames. The standard doesn’t use B-frames.

The coding algorithm uses hybrid block coding involving intra-frame prediction with motion compensation and DCT-based transform coding, which is pretty similar to MPEG-1 coding. After spatial and temporal redundancy is removed, each block is DCT transformed, quantized and then put in sequence using zig-zag algorithm and Huffman (lossless) coded. Additionally, a loop filter is used to smooth out the differences between blocks of the predicted image, improving the intra-frame prediction.

Loop filter works with a sequence of frames and removes the blocking artefacts introduced by the DCT transform of each block. Its task is to smooth out the hard edges between the blocks of the frame. The smoothing is performed repeatedly in a loop until the threshold is reached. Even though the loop may take more processing time than the decoder but in the end, the motion estimation may produce smaller motion vectors to be encoded.

The H.263 standard brings higher efficiency coding compared to H.261. Thanks to using some techniques from MPEG-1 it provides bitrate reduction up to 50% while maintaining the same subjective quality. In comparison to H.261, the H.263 standard brings wider video format support (SQCIF, 4CIF, 16CIF), improved motion vector estimation, modified VLC coding and introduction of PB-frames.

Motion estimation in H.263 works with half-pel (half-pixel) prediction. While H.261 motion vectors only work with integers, vectors in H.263 are represented with 0.5 precision. Moreover, the motion vector of a macroblock is estimated based on comparison (median calculation) with motion vectors of the nearby macroblocks and only difference between the estimated and real motion vector is transmitted (this is called median prediction).

The PB-frame mode works in a similar way the MPEG-1 codec. The P-frame is obtained from the previous I-frame or P-frame. The B-frame is obtained by 2-way prediction from the surrounding frames. The difference between MPEG-1 and H.263 is that the H.263 B-frame is contained within the P-frame, forming the PB-frame. This is beneficial for low-bitrate videos.

An extension to the H.263 standard, the H.263+ brings robustness against transfer errors, dynamic scene resolution and frame scaling.

MPEG-4 AVC/H.264

The other (and newer) version of MPEG-4, known as the advanced video coding, is the most used standard today. It is maintained jointly by the ISO and ITU-T organizations and particularly suitable for high definition video compression.

The standard brings many improvements to the previous standards of both organizations, such as higher resolution colour information, scalable video coding, and multi-view video coding which enables coding of more angles of video and allows for stereoscopic (3D) video.

Variable block size allows precise precise segmentation of moving regions with sizes ranging from 16x16 pixels to 4x4 pixels. Multiple motion vectors can be derived from one macroblock pointing to different reference pictures. The motion compensation algorithm works with quarter-pixel precision (compared to H.263’s half-pixel precision) enabling higher motion vector accuracy. The DCT transform is improved and adjusted in a way which allows exactly specified decoding. Additionally, a secondary Hadamard transform can be used in smooth regions to further improve the compression ratio.

Moreover, lossless macroblock coding is introduced allowing perfect representation of specific areas of the image, working in two modes, the PCM macroblock or enhanced lossless macroblock (with greater efficiency). Entropy coding introduces new algorithms, the Context-Adaptive Binary Arithmetic Coding and Context-Adaptive Variable-Length Coding, to encode syntax elements and quantized transform coefficient values more effectively than in previous standards.

There are plenty of other improvements leading to maintaining the same subjective quality as the older standards at half or even less bitrate, especially on high bitrate and high resolution videos.

Similarly to MPEG-2, the MPEG-4 AVC/H.264 standard supports coding profiles to be used in different target applications and levels defining the required decoder performance.

WebM

WebM is an open source audio and video codec by Google to be used with HTML5 video. It is a multimedia container based on Matroska which contains Ogg Vorbis-coded audio and VP8-coded video.

The VP8 codec was developed by On2 Technologies and was released under open source license by Google who bought the company in 2010. Even though the VP8 codec uses many of the techniques introduced by both MPEG and H.26x standards it brings more improvements with the aim to keep the high subjective quality while reducing the computing complexity. Some of the improvements follow.

The codec uses the so-called constructed reference frame which serves as a reference frame for motion compensation of more than one predicted frame. The look of the constructed reference frame is not specified so it is up to the designers. The loop filtering procedure which removes the blocking artefacts from the spatial redundancy removal (DCT transform) can use different number of blocks in sequence per each block. The entropy coding uses mostly binary arithmetic coding which is adaptive to each frame individually.

Apart from the coding standards mentioned before, there are plenty of other video formats, for example Ogg Theora, which is based on an older VP3 codec by On2 Technologies, the Windows Media Video (WMV) by Microsoft and many more standards not covered by ISO or ITU-T organizations.