MPEG-4 File Structure: Atoms and Boxes Explained

The MPEG-4 (MP4) file format is a widely used digital multimedia container format most commonly used to store video, audio, and subtitles. This article provides a clear, concise breakdown of how the standard MPEG-4 file format structures its data using hierarchical building blocks known as atoms or boxes, explaining their roles, types, and how they organize multimedia data for playback.

The Concept of Boxes (Atoms)

At the core of the MPEG-4 file format (defined by the ISO/IEC 14496-12 standard) is an object-oriented structure. Every piece of data in an MP4 file is encapsulated within a container called a box (historically referred to as an atom in Apple’s QuickTime format).

An MP4 file is essentially a continuous sequence of these boxes. Boxes can contain actual data, or they can act as containers that hold other sub-boxes, creating a hierarchical, tree-like nested structure.

The Anatomy of a Box

Every box starts with a standard header that allows parsers to navigate the file without needing to decode the actual payload. A basic box header consists of two main fields:

  1. Size (4 bytes): A 32-bit integer indicating the total size of the box in bytes, including the header. If the size is set to 1, a 64-bit large size field follows the type field to accommodate files larger than 4 gigabytes.
  2. Type (4 bytes): A four-character code (FourCC) comprised of ASCII characters that identifies the box’s function (e.g., ftyp, moov, mdat).

Following the header is the payload, which contains either specific data fields or nested sub-boxes.

Key Boxes in an MP4 File

While there are dozens of different box types defined in the MPEG-4 standard, a typical, playable MP4 file relies on a few critical root-level boxes:

1. ftyp (File Type Box)

This is always the first box in the file. It identifies the file’s compatible brands, specifications, and minor version. This tells the media player which decoders and standards are required to read the file.

2. moov (Movie Box)

The moov box is the metadata container for the entire file. It does not contain the actual audio or video samples, but rather the instructions on how to play them. Because it contains the timeline, codecs, and location offsets of the media samples, a player cannot start playback until it parses the moov box. It contains nested sub-boxes such as: * mvhd (Movie Header): Contains overall file information like creation time, duration, and time scale. * trak (Track Box): Represents a single track of media (e.g., one for video, one for audio). * stbl (Sample Table Box): Located deep inside the track box, this index maps media sample times to physical byte offsets within the file.

3. mdat (Media Data Box)

The mdat box contains the actual raw, compressed media payloads—the video frames, audio packets, and subtitle strings. Since it holds the bulk of the file’s data, it is usually the largest box. The mdat box is unformatted raw data; players rely entirely on the structural map inside the moov box to locate and decode individual frames within it.

How Players Read the Structure

To play an MP4 file, a media player performs the following sequence: 1. It reads the ftyp box to confirm compatibility. 2. It locates and parses the moov box to build an index of the video and audio tracks, finding where each frame starts and how long it lasts. 3. It uses the byte offsets from the moov box to jump to specific locations inside the mdat box, extracting and decoding the raw media packets for seamless real-time playback.