How Fragmented MP4 Works for Live Streaming

Fragmented MPEG-4 (fMP4) is a technology that enables efficient live streaming over the internet by dividing a standard MP4 file into smaller, sequential, and self-contained segments. Unlike traditional MP4 files that require the entire video metadata to be loaded before playback can begin, fMP4 allows players to decode and play video chunks in real-time as they are being generated. This article explains the underlying structure of MPEG-4 fragmentation, how the container format is modified to support live delivery, and why it is essential for modern streaming protocols like HLS and MPEG-DASH.

The Problem with Traditional MP4 for Live Streaming

In a standard MP4 container, all metadata containing the timing, size, and location of every video and audio frame is stored in a single structure called the Movie Box (moov). The actual media data (the audio and video samples) is stored in the Media Data Box (mdat).

For a player to start decoding a traditional MP4, it must first read the moov box. If the moov box is placed at the end of the file (the default for many encoders), the player must download the entire file before playback starts. Even if the moov box is moved to the beginning of the file (a process called fast-start), the metadata must still contain information for the entire video. During a live stream, the video has no defined end, meaning the encoder cannot write a final moov box. This makes traditional MP4 entirely unsuitable for live broadcasting.

How Fragmentation Solves the Problem

Fragmentation reorganizes the MPEG-4 container so that the metadata and media samples are interleaved throughout the stream in small, manageable pieces. Instead of one massive moov box and one massive mdat box, a fragmented MP4 (fMP4) file consists of an initialization segment followed by a continuous sequence of movie fragments.

The Structure of a Fragmented MP4

An fMP4 stream is divided into two primary component types:

Initialization Segment: This segment contains the File Type Box (ftyp) and a minimal Movie Box (moov). In an fMP4, this moov box does not contain any sample duration or location data. Instead, it only defines the track configurations, codecs, and basic properties of the stream. The player downloads this small initialization segment once at the beginning of the session so it knows how to configure the decoders.
Media Segments (Fragments): Following the initialization segment, the stream is delivered as a series of independent fragments. Each fragment consists of two crucial boxes:
- Movie Fragment Box (moof): This acts as a mini-metadata box. It contains the timing, duration, and byte offsets for only the video and audio frames contained within that specific fragment.
- Media Data Box (mdat): This box immediately follows the moof box and contains the actual encoded video and audio samples for that specific duration (typically 2 to 6 seconds of video).

The Playback and Delivery Process

During a live stream, the encoder continuously compresses the incoming audio and video feed. At set intervals (often corresponding to keyframe boundaries), the packager closes the current fragment, generates the moof and mdat pair, and publishes it to a web server.

The media player retrieves the initialization segment first to set up the video decoder. It then fetches the newly created fragments sequentially using a manifest or playlist file (such as an .m3u8 index for HLS or an .mpd file for MPEG-DASH).

Because each fragment contains its own local metadata (moof), the player can immediately decode and render the frames within the corresponding mdat box without needing any information about past or future fragments. This architecture allows live streams to be delivered with minimal latency, high compatibility over standard HTTP web servers, and the flexibility to switch video qualities seamlessly mid-stream.