What Internal Data Structures Does libaom Use for Video Frames?

The libaom library, the open-source reference encoder and decoder for the AV1 video codec developed by the Alliance for Open Media, relies on a sophisticated hierarchy of internal data structures to manage, manipulate, and optimize video frame data during compression and decompression. Understanding these structures is crucial for developers looking to optimize video processing pipelines, contribute to the codec, or integrate AV1 encoding into their software. This article explores the primary structures within the libaom source code, including aom_image_t, YV12_BUFFER_CONFIG, and the macroblock/coding block representations that drive the encoder’s decision-making process.

The External and API Layer: aom_image_t

At the public API layer, libaom interacts with external applications using the aom_image_t structure. This structure is defined in aom/aom_image.h and serves as the primary wrapper for passing raw input frames into the encoder or receiving decoded frames from the decoder.

Key fields within aom_image_t include:

w and h: The visible width and height of the image.
d_w and d_h: The display width and height, accounting for aspect ratio corrections.
planes: An array of pointers (typically up to three for Y, U, and V) pointing to the actual pixel data.
stride: An array indicating the memory width of each row for each plane, which may include padding for memory alignment.
fmt: The image format, specifying color space and bit depth (e.g., AOM_IMG_FMT_I420, AOM_IMG_FMT_I42016 for 10/12-bit).

The Internal Frame Buffer: YV12_BUFFER_CONFIG

While aom_image_t is the interface for the outside world, libaom’s internal core operates heavily on YV12_BUFFER_CONFIG. This structure, defined in aom_scale/yv12config.h, manages the actual allocated memory buffers used for reference frames, motion estimation, and filtering.

Unlike standard image containers, YV12_BUFFER_CONFIG includes extensive padding around the actual frame boundaries. This padding, often referred to as “border pixels,” allows motion compensation algorithms to fetch pixels outside the frame boundaries (by clamping or extending the edge pixels) without triggering out-of-bounds memory errors. It tracks the original width/height, the buffered width/height (including borders), strides, and the raw memory allocations for the luma (Y) and chroma (U/V) components.

Frame Processing and Decision Structures: AV1_COMP

For the encoder specifically, the top-level state is maintained in a massive structure called AV1_COMP (defined in av1/encoder/encoder.h). Within this context, several structures manage how a frame is broken down for processing:

AV1_COMMON: Shared between both the encoder and decoder, this structure stores frame-level metadata, such as the current frame type (Key frame, Inter frame, Golden frame), quantization parameters (quantization_params), loop filter settings, and the segmentation map.
Macroblock (MACROBLOCK and MACROBLOCKD): AV1 processes frames hierarchically using Superblocks (up to 128x128 pixels) partitioned into smaller Coding Units. MACROBLOCK stores the encoder-specific data used during the RDO (Rate-Distortion Optimization) search, including source residuals and transform coefficients. MACROBLOCKD (Macroblock Decoder) contains the destination data and syntax elements required to reconstruct the frame, acting as the state shared by both encoding and decoding loops.

Buffer Management: BufferPool and RefCntBuffer

Because AV1 uses complex prediction structures where frames can reference multiple past and future frames, libaom utilizes an internal frame buffer pool. The BufferPool structure manages an array of RefCntBuffer elements.

Each RefCntBuffer wraps a YV12_BUFFER_CONFIG along with reference counters. When a frame is designated as a reference frame (e.g., a Golden or AltRef frame), its reference count increments. It is only released back into the pool for reuse once the encoder or decoder no longer requires it for temporal prediction. This ensures efficient memory utilization without constant re-allocation during video playback or encoding.