How MPEG-4 Speech Compression Works Using CELP

MPEG-4 utilizes Code-Excited Linear Prediction (CELP) to compress human speech efficiently at extremely low bitrates, typically between 4 kbps and 24 kbps. This article explains how CELP models the human vocal tract, uses codebooks to transmit speech parameters instead of raw audio waveforms, and leverages perceptual weighting to maintain high voice quality while drastically reducing data usage.

The Source-Filter Model of Speech

At the core of CELP is the source-filter model, which mimics how the human body produces speech. In humans, the lungs and vocal cords act as the excitation “source” (producing either periodic air pulses for voiced sounds like vowels, or turbulent noise for unvoiced sounds like “sh”). The throat, mouth, and nasal cavity act as a “filter” that shapes this sound.

MPEG-4 CELP replicates this process digitally: * The Filter (Vocal Tract): Represented by Linear Predictive Coding (LPC) coefficients, which describe the shape of the vocal tract. * The Source (Excitation): Represented by vectors selected from digital codebooks.

Analysis-by-Synthesis (AbS)

Rather than analyzing the input speech and calculating the exact parameters to transmit, the CELP encoder uses a closed-loop search technique called Analysis-by-Synthesis (AbS).

Inside the encoder, a replica of the decoder synthesizes various trial speech signals using different excitation vectors. The encoder compares these synthesized signals against the original input speech. It then selects the excitation vector that produces the closest match to the original audio.

The Dual-Codebook System

To represent the excitation signal efficiently at ultra-low bitrates, MPEG-4 CELP uses two types of codebooks:

Adaptive Codebook: This codebook models the pitch and long-term periodicity of human speech, which is essential for voiced sounds. It updates dynamically using past excitation signals.
Fixed (Algebraic) Codebook: This codebook contains a static set of sparse, noise-like pulses. It is used to model the remaining part of the excitation signal, such as unvoiced sounds and rapid transitions.

Instead of transmitting the actual waveform, the encoder only needs to transmit the index (or address) of the best-matching vectors from these codebooks, along with gain factors.

Perceptual Weighting Filter

To optimize the limited bitrate, CELP applies a perceptual weighting filter during the Analysis-by-Synthesis search. This filter is based on human auditory masking properties.

The human ear cannot easily detect distortion or noise that occurs at frequencies where the audio signal is loudest. The perceptual weighting filter shapes the quantization noise so that it is concentrated in the spectral peaks of the speech signal. This hides the compression artifacts under the loudest parts of the voice, allowing the audio to sound clear to the human ear despite heavy data compression.

What is Transmitted

Because CELP reduces speech to its fundamental physical components, the final compressed bitstream is incredibly small. The MPEG-4 CELP encoder only transmits: * LPC Coefficients: Updated every 10–30 milliseconds to define the vocal tract filter. * Adaptive Codebook Index and Gain: Defining the pitch of the voice. * Fixed Codebook Index and Gain: Defining the noise/excitation details.

At the receiving end, the MPEG-4 decoder uses these parameters to reconstruct the speech, resulting in highly intelligible voice communication over extremely narrow bandwidths.