How to Save Wget Downloads to a WARC File?

This article provides a straightforward guide on how to use the wget command-line utility to archive web pages and their dependencies directly into a Web ARChive (WARC) file. You will learn the specific flags required to enable WARC logging, how to define the archive’s filename, and best practices for creating digital archives that comply with international preservation standards.

Understanding the WARC Format and Wget

The WARC (Web ARChive) format is the industry standard for digital preservation, used extensively by institutions like the Internet Archive and the Library of Congress. Unlike a standard HTML download or a simple ZIP file, a WARC file combines the downloaded content, the original HTTP request and response headers, and critical metadata into a single, standardized file.

wget has built-in support for generating these archives, making it an excellent tool for scraping websites while maintaining their precise historical and technical context.

The Basic Command for WARC Logging

To instruct wget to record its download process into a WARC file, you need to use the --warc-file option followed by your desired filename.

wget --warc-file="my-archive" https://example.com

When you run this command, wget will perform two main actions:

It downloads the target webpage (https://example.com) to your local directory as it normally would.
It simultaneously creates a file named my-archive.warc.gz containing the full capture of the network session.

Note: wget automatically compresses the resulting WARC file using Gzip, which is why the .gz extension is appended to your filename.

Recommended Flags for Complete Web Archiving

Simply downloading a single HTML page usually isn’t enough to faithfully archive a website. To create a robust, self-contained WARC file, you should combine the WARC flag with options that force wget to download page prerequisites (like images, scripts, and stylesheets) and follow links.

Here is a comprehensive command for standard web archiving:

wget --page-requisites --span-hosts --convert-links \
     --warc-file="comprehensive-archive" https://example.com

--page-requisites (or -p): Tells wget to download all the assets needed to display the HTML page correctly, such as images, CSS, and JavaScript.
--span-hosts (or -H): Allows wget to visit external hosts. This is crucial for modern websites that load assets (like fonts or jQuery libraries) from external Content Delivery Networks (CDNs).
--convert-links (or -k): Modifies the links in the downloaded HTML files to point to local files, ensuring the site can be browsed offline exactly as it was captured.

Disabling the Local Copy (Pure Archiving)

By default, wget writes every downloaded asset to your hard drive as individual files in addition to writing them into the compressed WARC file. If your only goal is to generate the single WARC file without cluttering your local directory with thousands of loose files, you can redirect the standard output using the delete or reject options, or pipe the data.

However, a cleaner, native way to handle this in standard archiving workflows is to use the --delete-after flag:

wget --page-requisites --span-hosts --delete-after \
     --warc-file="pure-archive" https://example.com

With --delete-after, wget will still download the files to process them and write them securely into the pure-archive.warc.gz file, but it will immediately delete the loose local files from your drive once the download session concludes.