What Internal Data Structures Does libaom Use for Video Frames?
The libaom library, the open-source reference encoder and decoder for
the AV1 video codec developed by the Alliance for Open Media, relies on
a sophisticated hierarchy of internal data structures to manage,
manipulate, and optimize video frame data during compression and
decompression. Understanding these structures is crucial for developers
looking to optimize video processing pipelines, contribute to the codec,
or integrate AV1 encoding into their software. This article explores the
primary structures within the libaom source code, including
aom_image_t, YV12_BUFFER_CONFIG, and the
macroblock/coding block representations that drive the encoder’s
decision-making process.
The External and API Layer: aom_image_t
At the public API layer, libaom interacts with external applications
using the aom_image_t structure. This structure is defined
in aom/aom_image.h and serves as the primary wrapper for
passing raw input frames into the encoder or receiving decoded frames
from the decoder.
Key fields within aom_image_t include:
- w and h: The visible width and height of the image.
- d_w and d_h: The display width and height, accounting for aspect ratio corrections.
- planes: An array of pointers (typically up to three for Y, U, and V) pointing to the actual pixel data.
- stride: An array indicating the memory width of each row for each plane, which may include padding for memory alignment.
- fmt: The image format, specifying color space and bit depth (e.g., AOM_IMG_FMT_I420, AOM_IMG_FMT_I42016 for 10/12-bit).
The Internal Frame Buffer: YV12_BUFFER_CONFIG
While aom_image_t is the interface for the outside
world, libaom’s internal core operates heavily on
YV12_BUFFER_CONFIG. This structure, defined in
aom_scale/yv12config.h, manages the actual allocated memory
buffers used for reference frames, motion estimation, and filtering.
Unlike standard image containers, YV12_BUFFER_CONFIG
includes extensive padding around the actual frame boundaries. This
padding, often referred to as “border pixels,” allows motion
compensation algorithms to fetch pixels outside the frame boundaries (by
clamping or extending the edge pixels) without triggering out-of-bounds
memory errors. It tracks the original width/height, the buffered
width/height (including borders), strides, and the raw memory
allocations for the luma (Y) and chroma (U/V) components.
Frame Processing and Decision Structures: AV1_COMP
For the encoder specifically, the top-level state is maintained in a
massive structure called AV1_COMP (defined in
av1/encoder/encoder.h). Within this context, several
structures manage how a frame is broken down for processing:
- AV1_COMMON: Shared between both the encoder and
decoder, this structure stores frame-level metadata, such as the current
frame type (Key frame, Inter frame, Golden frame), quantization
parameters (
quantization_params), loop filter settings, and the segmentation map. - Macroblock (MACROBLOCK and MACROBLOCKD): AV1
processes frames hierarchically using Superblocks (up to 128x128 pixels)
partitioned into smaller Coding Units.
MACROBLOCKstores the encoder-specific data used during the RDO (Rate-Distortion Optimization) search, including source residuals and transform coefficients.MACROBLOCKD(Macroblock Decoder) contains the destination data and syntax elements required to reconstruct the frame, acting as the state shared by both encoding and decoding loops.
Buffer Management: BufferPool and RefCntBuffer
Because AV1 uses complex prediction structures where frames can
reference multiple past and future frames, libaom utilizes an internal
frame buffer pool. The BufferPool structure manages an
array of RefCntBuffer elements.
Each RefCntBuffer wraps a
YV12_BUFFER_CONFIG along with reference counters. When a
frame is designated as a reference frame (e.g., a Golden or AltRef
frame), its reference count increments. It is only released back into
the pool for reuse once the encoder or decoder no longer requires it for
temporal prediction. This ensures efficient memory utilization without
constant re-allocation during video playback or encoding.