MPEG-4 Metadata Chapters and Subtitles Explained

The MPEG-4 (MP4) file format, standardized under ISO/IEC 14496-14, is a digital multimedia container format widely used for storing video, audio, subtitles, and metadata. Rather than saving this information in a flat file, MP4 relies on a hierarchical structure of nested data blocks called “atoms” or “boxes.” This article explains how the MP4 container organizes, stores, and synchronizes metadata, chapter markers, and subtitles within these specialized boxes.

The Foundation: The Atom Structure

To understand how MP4 stores auxiliary data, one must understand its foundational structure. An MP4 file is composed of “atoms” (or boxes), each containing a header (specifying size and a four-character type identifier, or “fourcc”) and a payload. The primary atom for media layout is the Movie Atom (moov), which acts as an index for the entire file. While the actual raw audio and video data reside in the Media Data Atom (mdat), the metadata, chapters, and subtitle configurations are defined within the moov atom and its sub-boxes.

How MP4 Stores Metadata

Metadata in an MP4 file—such as the title, artist, release date, and album artwork—is stored within the moov atom, specifically designed to hold structural and descriptive information. There are two primary ways metadata is integrated:

  1. The udta (User Data) and meta Boxes: Inside the moov atom, a User Data Box (udta) contains a Metadata Box (meta). This meta box contains a list of keys and values defining the file’s properties.
  2. iTunes Metadata (ilst atom): The most common standard for MP4 metadata is the iTunes-style metadata format. Located inside the meta box, the Item List Box (ilst) acts as a registry. It uses specific fourcc codes to store tags (e.g., ©nam for track name, ©ART for artist, and covr for JPEG or PNG cover art).

Because this metadata is stored in the header (moov atom), media players can quickly parse and display the file details instantly without needing to scan the entire video stream.

How MP4 Stores Chapter Markers

Chapter markers allow viewers to jump to specific points in a video. In the MP4 format, chapters are not stored as simple timestamps in a text file; instead, they are treated as a specialized, silent text track.

When a player loads the video, it reads the tref box, finds the associated chapter track, and maps the timed text samples to the player’s seek bar.

How MP4 Stores Subtitles

Subtitles in MP4 files can be stored in several formats, categorized as either externalized timed text tracks or embedded bitstreams.

  1. MPEG-4 Timed Text (3GPP / tx3g): This is the native subtitle format for MP4. Subtitles are stored in a dedicated subtitle track (trak with a handler type of sbtl). The sample description box uses the tx3g format code. The subtitles are stored as raw text samples, and formatting data (such as font, color, size, and positioning) is stored in a separate style record block within the same track. This ensures high-quality, vector-based rendering that scales with the player’s window.
  2. Closed Captions (EIA-608 / EIA-708): Instead of utilizing a separate track, traditional closed captions are often embedded directly into the video stream itself. They are stored within the Supplemental Enhancement Information (SEI) NAL units of H.264 or H.265 video bitstreams. The player decodes these captions frame-by-frame alongside the video.
  3. Advanced Sub Station Alpha (ASS/SSA) and SRT: While widely popular, formats like SRT and ASS are not natively defined in the original MP4 specification. When muxed into an MP4 container, they are typically converted into the native tx3g timed text format, or encapsulated into private streams within the container, depending on the software used to create the file.