How Fragmented MP4 Works for Live Streaming

Fragmented MPEG-4 (fMP4) is a technology that enables efficient live streaming over the internet by dividing a standard MP4 file into smaller, sequential, and self-contained segments. Unlike traditional MP4 files that require the entire video metadata to be loaded before playback can begin, fMP4 allows players to decode and play video chunks in real-time as they are being generated. This article explains the underlying structure of MPEG-4 fragmentation, how the container format is modified to support live delivery, and why it is essential for modern streaming protocols like HLS and MPEG-DASH.

The Problem with Traditional MP4 for Live Streaming

In a standard MP4 container, all metadata containing the timing, size, and location of every video and audio frame is stored in a single structure called the Movie Box (moov). The actual media data (the audio and video samples) is stored in the Media Data Box (mdat).

For a player to start decoding a traditional MP4, it must first read the moov box. If the moov box is placed at the end of the file (the default for many encoders), the player must download the entire file before playback starts. Even if the moov box is moved to the beginning of the file (a process called fast-start), the metadata must still contain information for the entire video. During a live stream, the video has no defined end, meaning the encoder cannot write a final moov box. This makes traditional MP4 entirely unsuitable for live broadcasting.

How Fragmentation Solves the Problem

Fragmentation reorganizes the MPEG-4 container so that the metadata and media samples are interleaved throughout the stream in small, manageable pieces. Instead of one massive moov box and one massive mdat box, a fragmented MP4 (fMP4) file consists of an initialization segment followed by a continuous sequence of movie fragments.

The Structure of a Fragmented MP4

An fMP4 stream is divided into two primary component types:

The Playback and Delivery Process

During a live stream, the encoder continuously compresses the incoming audio and video feed. At set intervals (often corresponding to keyframe boundaries), the packager closes the current fragment, generates the moof and mdat pair, and publishes it to a web server.

The media player retrieves the initialization segment first to set up the video decoder. It then fetches the newly created fragments sequentially using a manifest or playlist file (such as an .m3u8 index for HLS or an .mpd file for MPEG-DASH).

Because each fragment contains its own local metadata (moof), the player can immediately decode and render the frames within the corresponding mdat box without needing any information about past or future fragments. This architecture allows live streams to be delivered with minimal latency, high compatibility over standard HTTP web servers, and the flexibility to switch video qualities seamlessly mid-stream.