Structure of an Opus Audio Format Packet

This article provides a technical overview of the structure of an Opus audio format packet. It breaks down the self-delimiting framing system of the Opus codec, detailing the vital Table of Contents (TOC) byte, configuration codes, frame count indicators, and how individual audio frames are packed within a single payload.

An Opus packet is designed to be self-contained and self-delimiting, allowing decoders to understand the packet’s configuration without needing external metadata. Every Opus packet consists of a mandatory Table of Contents (TOC) byte, optional frame length indicators, and one or more audio frames containing the compressed payload.

1. The Table of Contents (TOC) Byte

The very first byte of any Opus packet is the TOC byte. This single byte is critical because it defines how the rest of the packet must be parsed. The TOC byte is split into three distinct fields:

Configuration Number (Bits 0–4): The first five bits specify one of 32 possible configurations. This configuration dictates which engine is used (SILK for voice, CELT for music/low-latency, or a hybrid of both), the audio bandwidth (narrowband, mediumband, wideband, super-wideband, or fullband), and the frame duration (ranging from 2.5 ms to 60 ms).
Stereo Flag (Bit 5): The sixth bit (‘s’) indicates the channel configuration. A value of 0 means the packet contains mono audio, while a value of 1 indicates stereo audio.
Frame Count Code (Bits 6–7): The last two bits (‘c’) represent the Frame Count Code, which tells the decoder how many frames are packed into this single packet:
- 00 (Code 0): Exactly 1 frame.
- 01 (Code 1): Exactly 2 frames of equal duration.
- 10 (Code 2): Exactly 2 frames of different durations.
- 11 (Code 3): An arbitrary number of frames (from 1 to 48 frames, up to a maximum packet duration of 120 ms).

2. Frame Length Indicators

Depending on the Frame Count Code defined in the TOC byte, the packet may contain optional bytes to specify the length of the audio frames before the payload begins.

For Code 0 and Code 1: No length indicators are needed. The size of the frame(s) is implicitly determined by the total size of the packet minus the 1-byte TOC.
For Code 2: A 1- or 2-byte length indicator is present. This specifies the length of the first frame. The length of the second frame is calculated by subtracting the first frame’s length and the TOC/length bytes from the total packet size.
For Code 3: This code introduces a “Code 3 payload header” directly after the TOC byte. This header consists of:
- A signaling byte: The first 3 bits indicate the exact number of frames in the packet. The 4th bit is the VBR (Variable Bitrate) flag (0 for CBR, 1 for VBR). The 5th bit is the padding flag. The remaining 3 bits are unused.
- Optional padding bytes: If the padding flag is set, these bytes are used to inflate the packet size.
- Length indicators: If VBR is used, the header contains length indicators for the first \(N-1\) frames (where \(N\) is the number of frames). If CBR is used, no length indicators are present, as all frames are assumed to be of equal size.

3. The Audio Payload

The remainder of the Opus packet is the actual audio payload, containing the compressed audio data.

Depending on the configuration determined by the TOC byte, the payload is processed by one of three internal modes: * SILK Mode: Optimized for speech preservation, typically operating at lower sample rates and bitrates. * CELT Mode: Optimized for high-fidelity music and ultra-low latency, operating across the full frequency spectrum. * Hybrid Mode: Uses SILK for lower frequencies (up to 8 kHz) and CELT for higher frequencies (above 8 kHz) within the same frame to maximize efficiency.

The decoder reads the frames sequentially, using the structural boundaries defined by the TOC and length bytes to parse and reconstruct the audio signal.