VP9 Temporal Scalability in libvpx

This article explains how temporal scalability functions within the libvpx-vp9 encoder to dynamically scale video framerates. You will learn about the hierarchical structure of temporal layers, how frame reference dependencies prevent decoding errors when frames are discarded, and how developers configure libvpx to implement this technology for adaptive video streaming.

Understanding Temporal Layers

Temporal scalability works by dividing a single video stream into multiple hierarchical layers, categorized as a base layer and one or more enhancement layers. Each layer represents a fraction of the target framerate. For example, in a 30 frames per second (fps) video with three temporal layers:

Layer 0 (Base Layer): Encodes the video at 7.5 fps.
Layer 1 (Enhancement Layer 1): Adds another 7.5 fps, bringing the combined rate to 15 fps.
Layer 2 (Enhancement Layer 2): Adds 15 fps, bringing the total combined rate to 30 fps.

A decoder can subscribe to only Layer 0, Layers 0 and 1, or all three layers, depending on available CPU and network bandwidth.

Reference Frame Restrictions

The mechanism that allows a receiver to drop frames dynamically without corrupting the video stream is strict reference frame management. In standard video encoding, frames rely on past or future frames for compression (inter-frame prediction). With temporal scalability, libvpx enforces rules on which frames can reference each other:

Lower layers never reference higher layers. A frame in Layer 0 can only use previous frames in Layer 0 as references.
Enhancement layers can only reference their own layer or lower layers. A frame in Layer 2 can reference frames in Layer 1 or Layer 0, but never vice versa.

Because of these rules, if a network router or client drops all packets belonging to Layer 2, the remaining Layer 0 and Layer 1 packets can still be fully decoded. The video will continue to play smoothly at 15 fps instead of 30 fps, without any blocky artifacts or decoding failures.

Dynamic Framerate Dropping in Practice

In live video scenarios, such as WebRTC, dynamic framerate dropping is managed by a Selective Forwarding Unit (SFU) or the client application:

Tagging: When encoding, libvpx tags each compressed frame with a Temporal Layer ID (TID).
Monitoring: The SFU monitors network conditions (packet loss, round-trip time, bandwidth).
Dropping: If congestion occurs, the SFU stops forwarding packets marked with the highest TID (e.g., TID 2). The receiving client immediately experiences a lower framerate (e.g., 15 fps instead of 30 fps) but maintains crisp image quality and zero freezing.
Recovery: When bandwidth recovers, the SFU resumes forwarding TID 2 packets, and the stream dynamically scales back up to 30 fps.

Libvpx Configuration

To enable this behavior in the libvpx-vp9 encoder, developers configure specific temporal scalability parameters in the vpx_codec_enc_cfg_t structure:

ts_number_layers: Defines the total number of temporal layers (up to 5).
ts_target_bitrate: Allocates a specific bitrate budget to each layer.
ts_rate_decimator: Determines the framerate division factor for each layer.
ts_layer_id: Assigns the specific temporal layer mapping for each frame during the encoding loop.

By utilizing these controls, libvpx allows applications to react instantly to fluctuating network environments without the costly CPU overhead of re-encoding the video stream.