How to Make Wget Ignore Robots.txt?

This article provides a quick overview and step-by-step guide on how to force the wget command-line utility to bypass the robots.txt restrictions on a target website. While wget respects these scraping guidelines by default, you can override this behavior using a specific command-line flag. Below, we explore the exact command to use, how it works, and the ethical considerations you should keep in mind before bypassing these restrictions.

The Short Answer: The -e robots=off Flag

By default, if a website’s robots.txt file forbids scraping, wget will stop downloading and display a message indicating that access is barred by the robots.txt rules. To force wget to ignore these rules, you need to execute the command with the execute (-e) option to turn the robots preference off.

The standard syntax for the command is:

wget -e robots=off <URL>

Mirroring a Full Site While Ignoring Robots.txt

If your goal is to download or mirror an entire website for offline viewing while bypassing the robots restrictions, you will typically combine the robots flag with the mirroring (-m) and recursive (-r) options.

A common robust command looks like this:

wget -r -l 0 -e robots=off <URL>

In this command:

Important Ethics and Best Practices

While technical workarounds exist, bypassing a website’s robots.txt should be done with caution. Website owners use this file to protect their server bandwidth and prevent automated bots from crashing their site.

If you must bypass the restrictions, consider using the --wait flag (e.g., --wait=2) to add a delay between your requests. This prevents your script from overwhelming the host server and reduces the likelihood of your IP address getting temporarily or permanently blocked.