How VP9 Alternate Reference Frames Work in libvpx

This article explores the inner workings of the automatic alternate reference frames (alt-ref) feature in the libvpx-vp9 video codec. We will examine how this feature enhances compression efficiency by synthesizing invisible reference frames from future source frames, the mechanics of temporal filtering, and how the encoder utilizes these frames to improve prediction accuracy without increasing display latency.

What is an Alternate Reference Frame?

In traditional video coding standards like H.264, highly efficient bi-directional prediction is achieved using B-frames. B-frames reference both past and future frames in display order but must eventually be displayed themselves. VP9 approaches this differently by introducing Alternate Reference (alt-ref) frames.

An alt-ref frame is a frame that is decoded and stored in the reference buffer but is never actually displayed to the user. Its sole purpose is to serve as a high-quality predictor for other frames in the Group of Pictures (GOP). Because it is invisible, the encoder can modify, denoise, or synthesize this frame to maximize its utility as a reference, without worrying about visual artifacts that would be distracting if the frame were directly displayed.

The Under-the-Hood Process

The automatic alt-ref mechanism in libvpx operates through a series of specialized steps during the encoding pipeline:

1. Lookahead and Multi-Pass Analysis

For automatic alt-ref selection to work, libvpx relies on lookahead buffers (typically enabled via two-pass encoding). The encoder analyzes a window of upcoming frames to determine the optimal boundaries for a GOP. It identifies a future frame—often at the end of a sub-GOP—to serve as the source template for the alt-ref frame.

2. Temporal Filtering (Denoising)

Once a target frame is selected to become an alt-ref frame, libvpx applies a process called temporal filtering. * The encoder looks at a group of frames surrounding the target frame (both preceding and succeeding it). * It performs motion estimation to align the blocks of these neighboring frames with the target frame. * A weighted average of the aligned blocks is calculated. This averaging dramatically reduces high-frequency temporal noise while preserving static or consistently moving details. * The resulting synthetic frame is cleaner and easier to compress than any single raw frame from the source video.

3. Encoding as an Invisible Frame

The synthesized alt-ref frame is encoded into the bitstream. Crucially, the encoder sets the header flag show_frame to 0. When the decoder processes this frame, it decodes the pixel data and stores it in one of the VP9 reference frame buffers (specifically designated for alt-ref). However, the decoder bypasses the display step, meaning the viewer never sees this synthesized, temporally filtered frame directly.

4. Bi-directional Prediction

Subsequent frames in the GOP can now use this high-quality, denoised alt-ref frame for forward prediction, while using previous frames for backward prediction. Because the alt-ref frame represents a noise-free, future state of the video, the motion vectors and residual data for intermediate frames are significantly smaller, leading to a substantial drop in the required bitrate.

5. The Overlay Frame

When the video timeline reaches the display presentation time of the frame that was used to build the alt-ref, the encoder produces an overlay frame (also known as a show-existing-frame). This is a tiny frame that tells the decoder to display the frame currently held in the alt-ref buffer. Because the alt-ref frame has already been decoded and stored, this overlay frame requires almost zero bits, completing the GOP cycle efficiently.