How to Save Wget Downloads to a WARC File?
This article provides a straightforward guide on how to use the
wget command-line utility to archive web pages and their
dependencies directly into a Web ARChive (WARC) file. You will learn the
specific flags required to enable WARC logging, how to define the
archive’s filename, and best practices for creating digital archives
that comply with international preservation standards.
Understanding the WARC Format and Wget
The WARC (Web ARChive) format is the industry standard for digital preservation, used extensively by institutions like the Internet Archive and the Library of Congress. Unlike a standard HTML download or a simple ZIP file, a WARC file combines the downloaded content, the original HTTP request and response headers, and critical metadata into a single, standardized file.
wget has built-in support for generating these archives,
making it an excellent tool for scraping websites while maintaining
their precise historical and technical context.
The Basic Command for WARC Logging
To instruct wget to record its download process into a
WARC file, you need to use the --warc-file option followed
by your desired filename.
wget --warc-file="my-archive" https://example.comWhen you run this command, wget will perform two main
actions:
- It downloads the target webpage (
https://example.com) to your local directory as it normally would. - It simultaneously creates a file named
my-archive.warc.gzcontaining the full capture of the network session.
Note:
wgetautomatically compresses the resulting WARC file using Gzip, which is why the.gzextension is appended to your filename.
Recommended Flags for Complete Web Archiving
Simply downloading a single HTML page usually isn’t enough to
faithfully archive a website. To create a robust, self-contained WARC
file, you should combine the WARC flag with options that force
wget to download page prerequisites (like images, scripts,
and stylesheets) and follow links.
Here is a comprehensive command for standard web archiving:
wget --page-requisites --span-hosts --convert-links \
--warc-file="comprehensive-archive" https://example.com--page-requisites(or-p): Tellswgetto download all the assets needed to display the HTML page correctly, such as images, CSS, and JavaScript.--span-hosts(or-H): Allowswgetto visit external hosts. This is crucial for modern websites that load assets (like fonts or jQuery libraries) from external Content Delivery Networks (CDNs).--convert-links(or-k): Modifies the links in the downloaded HTML files to point to local files, ensuring the site can be browsed offline exactly as it was captured.
Disabling the Local Copy (Pure Archiving)
By default, wget writes every downloaded asset to your
hard drive as individual files in addition to writing them into
the compressed WARC file. If your only goal is to generate the single
WARC file without cluttering your local directory with thousands of
loose files, you can redirect the standard output using the delete or
reject options, or pipe the data.
However, a cleaner, native way to handle this in standard archiving
workflows is to use the --delete-after flag:
wget --page-requisites --span-hosts --delete-after \
--warc-file="pure-archive" https://example.comWith --delete-after, wget will still
download the files to process them and write them securely into the
pure-archive.warc.gz file, but it will immediately delete
the loose local files from your drive once the download session
concludes.