How VP9 Alternate Reference Frames Work in libvpx
This article explores the inner workings of the automatic alternate
reference frames (alt-ref) feature in the libvpx-vp9 video
codec. We will examine how this feature enhances compression efficiency
by synthesizing invisible reference frames from future source frames,
the mechanics of temporal filtering, and how the encoder utilizes these
frames to improve prediction accuracy without increasing display
latency.
What is an Alternate Reference Frame?
In traditional video coding standards like H.264, highly efficient bi-directional prediction is achieved using B-frames. B-frames reference both past and future frames in display order but must eventually be displayed themselves. VP9 approaches this differently by introducing Alternate Reference (alt-ref) frames.
An alt-ref frame is a frame that is decoded and stored in the reference buffer but is never actually displayed to the user. Its sole purpose is to serve as a high-quality predictor for other frames in the Group of Pictures (GOP). Because it is invisible, the encoder can modify, denoise, or synthesize this frame to maximize its utility as a reference, without worrying about visual artifacts that would be distracting if the frame were directly displayed.
The Under-the-Hood Process
The automatic alt-ref mechanism in libvpx operates
through a series of specialized steps during the encoding pipeline:
1. Lookahead and Multi-Pass Analysis
For automatic alt-ref selection to work, libvpx relies
on lookahead buffers (typically enabled via two-pass encoding). The
encoder analyzes a window of upcoming frames to determine the optimal
boundaries for a GOP. It identifies a future frame—often at the end of a
sub-GOP—to serve as the source template for the alt-ref frame.
2. Temporal Filtering (Denoising)
Once a target frame is selected to become an alt-ref frame,
libvpx applies a process called temporal
filtering. * The encoder looks at a group of frames surrounding
the target frame (both preceding and succeeding it). * It performs
motion estimation to align the blocks of these neighboring frames with
the target frame. * A weighted average of the aligned blocks is
calculated. This averaging dramatically reduces high-frequency temporal
noise while preserving static or consistently moving details. * The
resulting synthetic frame is cleaner and easier to compress than any
single raw frame from the source video.
3. Encoding as an Invisible Frame
The synthesized alt-ref frame is encoded into the bitstream.
Crucially, the encoder sets the header flag show_frame to
0. When the decoder processes this frame, it decodes the
pixel data and stores it in one of the VP9 reference frame buffers
(specifically designated for alt-ref). However, the decoder bypasses the
display step, meaning the viewer never sees this synthesized, temporally
filtered frame directly.
4. Bi-directional Prediction
Subsequent frames in the GOP can now use this high-quality, denoised alt-ref frame for forward prediction, while using previous frames for backward prediction. Because the alt-ref frame represents a noise-free, future state of the video, the motion vectors and residual data for intermediate frames are significantly smaller, leading to a substantial drop in the required bitrate.
5. The Overlay Frame
When the video timeline reaches the display presentation time of the frame that was used to build the alt-ref, the encoder produces an overlay frame (also known as a show-existing-frame). This is a tiny frame that tells the decoder to display the frame currently held in the alt-ref buffer. Because the alt-ref frame has already been decoded and stored, this overlay frame requires almost zero bits, completing the GOP cycle efficiently.