How to Recursively Download a Website with Wget?

This article provides a practical guide on how to use the wget command-line utility to download an entire website for offline viewing. You will learn the essential terminal commands, the specific flags required for recursive downloading, and how to safely mirror a site without overloading the host server.

Understanding Recursive Downloading

When you download a website recursively, wget follows the links on the initial page and downloads those subsequent pages as well. This process repeats up to a specified depth, allowing you to fetch the entire structure of a site, including its HTML pages, images, stylesheets, and scripts.

The Standard Website Mirroring Command

The most efficient way to download a complete website using wget is by employing the built-in mirroring flag. The command looks like this:

wget --mirror --page-requisites --adjust-extension --convert-links --no-parent https://example.com

Breakdown of the Command Flags

Each flag in the command serves a specific purpose to ensure the downloaded website functions correctly on your local machine:

--mirror (-m): Turns on options suitable for mirroring. This enables infinite recursion, keeps directory listings, and preserves file timestamps.
--page-requisites (-p): Tells wget to download all the elements necessary to display the HTML page correctly, such as images, sounds, and cascading stylesheets (CSS).
--adjust-extension (-E): If a file has a non-standard extension or no extension at all (like a dynamically generated PHP page), this flag appends .html to ensure it opens properly in a web browser.
--convert-links (-k): After the download is complete, this converts the links in the document to relative links so that you can navigate the site locally without an internet connection.
--no-parent (-np): Restricts the download to the specified directory and its subdirectories. This prevents wget from following links to parent directories or escaping the intended section of the site.

Being a Good Netizen: Adding Delays

Downloading a website too quickly can strain the target server’s resources. To prevent your IP address from being blocked and to practice good web citizenship, you should introduce a time delay between your requests.

wget --mirror --page-requisites --adjust-extension --convert-links --no-parent --wait=2 --limit-rate=100k https://example.com

--wait=2: Forces wget to pause for 2 seconds between each file download.
--limit-rate=100k: Restricts the download speed to 100 kilobytes per second, reducing bandwidth consumption on both your end and the server’s end.