How Fragmented MP4 Works for Live Streaming
Fragmented MPEG-4 (fMP4) is a technology that enables efficient live streaming over the internet by dividing a standard MP4 file into smaller, sequential, and self-contained segments. Unlike traditional MP4 files that require the entire video metadata to be loaded before playback can begin, fMP4 allows players to decode and play video chunks in real-time as they are being generated. This article explains the underlying structure of MPEG-4 fragmentation, how the container format is modified to support live delivery, and why it is essential for modern streaming protocols like HLS and MPEG-DASH.
The Problem with Traditional MP4 for Live Streaming
In a standard MP4 container, all metadata containing the timing,
size, and location of every video and audio frame is stored in a single
structure called the Movie Box (moov). The actual media
data (the audio and video samples) is stored in the Media Data Box
(mdat).
For a player to start decoding a traditional MP4, it must first read
the moov box. If the moov box is placed at the
end of the file (the default for many encoders), the player must
download the entire file before playback starts. Even if the
moov box is moved to the beginning of the file (a process
called fast-start), the metadata must still contain information for the
entire video. During a live stream, the video has no defined end,
meaning the encoder cannot write a final moov box. This
makes traditional MP4 entirely unsuitable for live broadcasting.
How Fragmentation Solves the Problem
Fragmentation reorganizes the MPEG-4 container so that the metadata
and media samples are interleaved throughout the stream in small,
manageable pieces. Instead of one massive moov box and one
massive mdat box, a fragmented MP4 (fMP4) file consists of
an initialization segment followed by a continuous sequence of movie
fragments.
The Structure of a Fragmented MP4
An fMP4 stream is divided into two primary component types:
- Initialization Segment: This segment contains the
File Type Box (
ftyp) and a minimal Movie Box (moov). In an fMP4, thismoovbox does not contain any sample duration or location data. Instead, it only defines the track configurations, codecs, and basic properties of the stream. The player downloads this small initialization segment once at the beginning of the session so it knows how to configure the decoders. - Media Segments (Fragments): Following the
initialization segment, the stream is delivered as a series of
independent fragments. Each fragment consists of two crucial boxes:
- Movie Fragment Box (
moof): This acts as a mini-metadata box. It contains the timing, duration, and byte offsets for only the video and audio frames contained within that specific fragment. - Media Data Box (
mdat): This box immediately follows themoofbox and contains the actual encoded video and audio samples for that specific duration (typically 2 to 6 seconds of video).
- Movie Fragment Box (
The Playback and Delivery Process
During a live stream, the encoder continuously compresses the
incoming audio and video feed. At set intervals (often corresponding to
keyframe boundaries), the packager closes the current fragment,
generates the moof and mdat pair, and
publishes it to a web server.
The media player retrieves the initialization segment first to set up
the video decoder. It then fetches the newly created fragments
sequentially using a manifest or playlist file (such as an
.m3u8 index for HLS or an .mpd file for
MPEG-DASH).
Because each fragment contains its own local metadata
(moof), the player can immediately decode and render the
frames within the corresponding mdat box without needing
any information about past or future fragments. This architecture allows
live streams to be delivered with minimal latency, high compatibility
over standard HTTP web servers, and the flexibility to switch video
qualities seamlessly mid-stream.