How to Recursively Download a Website with Wget?
This article provides a practical guide on how to use the
wget command-line utility to download an entire website for
offline viewing. You will learn the essential terminal commands, the
specific flags required for recursive downloading, and how to safely
mirror a site without overloading the host server.
Understanding Recursive Downloading
When you download a website recursively, wget follows
the links on the initial page and downloads those subsequent pages as
well. This process repeats up to a specified depth, allowing you to
fetch the entire structure of a site, including its HTML pages, images,
stylesheets, and scripts.
The Standard Website Mirroring Command
The most efficient way to download a complete website using
wget is by employing the built-in mirroring flag. The
command looks like this:
wget --mirror --page-requisites --adjust-extension --convert-links --no-parent https://example.comBreakdown of the Command Flags
Each flag in the command serves a specific purpose to ensure the downloaded website functions correctly on your local machine:
--mirror(-m): Turns on options suitable for mirroring. This enables infinite recursion, keeps directory listings, and preserves file timestamps.--page-requisites(-p): Tellswgetto download all the elements necessary to display the HTML page correctly, such as images, sounds, and cascading stylesheets (CSS).--adjust-extension(-E): If a file has a non-standard extension or no extension at all (like a dynamically generated PHP page), this flag appends.htmlto ensure it opens properly in a web browser.--convert-links(-k): After the download is complete, this converts the links in the document to relative links so that you can navigate the site locally without an internet connection.--no-parent(-np): Restricts the download to the specified directory and its subdirectories. This preventswgetfrom following links to parent directories or escaping the intended section of the site.
Being a Good Netizen: Adding Delays
Downloading a website too quickly can strain the target server’s resources. To prevent your IP address from being blocked and to practice good web citizenship, you should introduce a time delay between your requests.
wget --mirror --page-requisites --adjust-extension --convert-links --no-parent --wait=2 --limit-rate=100k https://example.com--wait=2: Forceswgetto pause for 2 seconds between each file download.--limit-rate=100k: Restricts the download speed to 100 kilobytes per second, reducing bandwidth consumption on both your end and the server’s end.