How Does libaom Calculate SSIM Internally?
This article provides a technical overview of how libaom (the reference software encoder for the AV1 video format) calculates Structural Similarity (SSIM) internally during its rate control process. It explores the downsampling techniques, patch-based luminance and contrast math, and the specific ways these metrics feed back into frame-level and macroblock-level quantization decisions to optimize visual quality per bit.
The Role of SSIM in AV1 Rate Control
In video encoding, rate control algorithms must decide how many bits to allocate to each frame or block. While traditional encoders rely heavily on Mean Squared Error (MSE), libaom integrates SSIM (Structural Similarity Index) and MS-SSIM (Multi-Scale SSIM) to better align bit distribution with human visual perception.
When the encoder operates in a tune-for-SSIM mode (e.g.,
--tune=ssim), the rate control module dynamically adjusts
the quantization parameter (QP) based on the structural distortion it
predicts or measures.
Step-by-Step Internal Calculation
Libaom’s internal SSIM calculation follows a highly optimized pipeline designed to minimize the computational overhead of floating-point vision metrics during live encoding loops.
1. Downsampling and Windowing
Standard SSIM uses a Gaussian window to weight local pixel statistics. To achieve similar results efficiently, libaom processes images using localized pixel blocks (typically \(8 \times 8\) or \(16 \times 16\) patches). For Multi-Scale SSIM (MS-SSIM), the encoder iteratively downsamples the reference and distorted frames using a low-pass 2x2 average filter before recalculating metrics at coarser scales.
2. Local Statistical Accumulation
For any given local window, libaom calculates the essential statistical sums. If we define \(x\) as the original source patch and \(y\) as the reconstructed (distorted) patch, the encoder accumulates:
- \(\sum x\) and \(\sum y\) (local means)
- \(\sum x^2\) and \(\sum y^2\) (local variances)
- \(\sum xy\) (local covariance)
3. Applying the SSIM Formula
Using these accumulated sums, libaom evaluates the core SSIM formula internally using fixed-point arithmetic or optimized SIMD assembly (AVX2/NEON) to speed up execution. The calculation implements the standard three-component comparison:
\[\text{SSIM}(x,y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}\]
Where:
- \(\mu_x\) and \(\mu_y\) are the local means derived from \(\sum x\) and \(\sum y\).
- \(\sigma_x^2\) and \(\sigma_y^2\) are the variances.
- \(\sigma_{xy}\) is the covariance.
- \(C_1\) and \(C_2\) are stabilization constants based on the bit depth (e.g., 8-bit vs. 10-bit video) to prevent division-by-zero errors in dark or flat areas.
Integration into the Rate Control Loop
Once the SSIM values are calculated for local regions, libaom uses this information to guide its rate control decisions in two primary ways:
- Frame-Level Bit Allocation: The average SSIM of prior encoded frames helps the rate-distortion optimization (RDO) model predict how much quality will drop if the bitrate is restricted. If a scene consists of highly complex textures where standard MSE over-penalizes distortion, the SSIM metric informs the encoder that human eyes will tolerate more noise, allowing the rate control loop to save bits.
- Cyclic Refresh and Adaptive Quantization: Libaom uses localized SSIM variances to identify “flat” areas (where artifacts are highly visible) versus “textured” areas (where artifacts are masked). The rate control system adaptively lowers the QP (increases quality) for blocks with high structural importance and raises the QP for structurally chaotic blocks where compression artifacts are naturally hidden.