MPEG-4 Metadata Chapters and Subtitles Explained
The MPEG-4 (MP4) file format, standardized under ISO/IEC 14496-14, is a digital multimedia container format widely used for storing video, audio, subtitles, and metadata. Rather than saving this information in a flat file, MP4 relies on a hierarchical structure of nested data blocks called “atoms” or “boxes.” This article explains how the MP4 container organizes, stores, and synchronizes metadata, chapter markers, and subtitles within these specialized boxes.
The Foundation: The Atom Structure
To understand how MP4 stores auxiliary data, one must understand its
foundational structure. An MP4 file is composed of “atoms” (or boxes),
each containing a header (specifying size and a four-character type
identifier, or “fourcc”) and a payload. The primary atom for media
layout is the Movie Atom (moov), which acts as an index for
the entire file. While the actual raw audio and video data reside in the
Media Data Atom (mdat), the metadata, chapters, and
subtitle configurations are defined within the moov atom
and its sub-boxes.
How MP4 Stores Metadata
Metadata in an MP4 file—such as the title, artist, release date, and
album artwork—is stored within the moov atom, specifically
designed to hold structural and descriptive information. There are two
primary ways metadata is integrated:
- The
udta(User Data) andmetaBoxes: Inside themoovatom, a User Data Box (udta) contains a Metadata Box (meta). Thismetabox contains a list of keys and values defining the file’s properties. - iTunes Metadata (
ilstatom): The most common standard for MP4 metadata is the iTunes-style metadata format. Located inside themetabox, the Item List Box (ilst) acts as a registry. It uses specific fourcc codes to store tags (e.g.,©namfor track name,©ARTfor artist, andcovrfor JPEG or PNG cover art).
Because this metadata is stored in the header (moov
atom), media players can quickly parse and display the file details
instantly without needing to scan the entire video stream.
How MP4 Stores Chapter Markers
Chapter markers allow viewers to jump to specific points in a video. In the MP4 format, chapters are not stored as simple timestamps in a text file; instead, they are treated as a specialized, silent text track.
- The Chapter Track: A dedicated track
(
trak) is created within themoovatom. The handler reference box (hdlr) inside this track identifies it as a text or chapter track. - The Text Samples: The text track contains a series of timed text samples. Each sample contains the text string for a chapter title (e.g., “Introduction” or “Scene 2”).
- Track Linkage (
tref): To associate these text samples with the actual video timeline, the main video track contains a Track Reference Box (tref) with a reference type ofchap(chapters). Thischapbox points directly to the track ID of the text chapter track.
When a player loads the video, it reads the tref box,
finds the associated chapter track, and maps the timed text samples to
the player’s seek bar.
How MP4 Stores Subtitles
Subtitles in MP4 files can be stored in several formats, categorized as either externalized timed text tracks or embedded bitstreams.
- MPEG-4 Timed Text (3GPP /
tx3g): This is the native subtitle format for MP4. Subtitles are stored in a dedicated subtitle track (trakwith a handler type ofsbtl). The sample description box uses thetx3gformat code. The subtitles are stored as raw text samples, and formatting data (such as font, color, size, and positioning) is stored in a separate style record block within the same track. This ensures high-quality, vector-based rendering that scales with the player’s window. - Closed Captions (EIA-608 / EIA-708): Instead of utilizing a separate track, traditional closed captions are often embedded directly into the video stream itself. They are stored within the Supplemental Enhancement Information (SEI) NAL units of H.264 or H.265 video bitstreams. The player decodes these captions frame-by-frame alongside the video.
- Advanced Sub Station Alpha (ASS/SSA) and SRT: While
widely popular, formats like SRT and ASS are not natively defined in the
original MP4 specification. When muxed into an MP4 container, they are
typically converted into the native
tx3gtimed text format, or encapsulated into private streams within the container, depending on the software used to create the file.