How to Limit Recursion Depth in Wget?

When downloading websites recursively using the wget command-line tool, it is easy to accidentally trigger an infinite loop or download far more data than intended. This guide provides a straightforward solution for limiting the recursion depth to a specific number of levels, explains the key command-line flags required, and covers best practices for managing web scraping depth effectively.

The Short Answer: Using the -l Flag

The most direct way to control the depth of a recursive download in wget is by using the --level option, or its short-form counterpart, -l. By default, when you trigger a recursive download, wget sets a maximum depth limit of 5 levels.

To specify your own limit, use the following syntax:

wget -r -l <depth_number> <URL>

For example, if you only want to download the main page and the pages directly linked from it, you would set the depth to 1:

wget -r -l 1 https://example.com

Understanding Key Parameters

To gain full control over your download, it helps to understand how these specific flags interact:

Practical Examples of Depth Control

Different scraping scenarios require different levels of depth. Here is how to adjust your command based on common needs:

Depth Level Command Example Use Case
Depth 1 wget -r -l 1 https://example.com Grabbing a landing page and its immediate sub-pages only.
Depth 3 wget -r -l 3 https://example.com Downloading a specific multi-tiered documentation section.
Infinite wget -r -l 0 https://example.com Full website mirroring (ensure you have permission and space).

Important Safeguards to Pair with Recursion Limits

Limiting the depth is a great first step, but deep recursive downloads can still cause issues if you don’t restrict wget from wandering onto external websites.

To keep your recursive download safe and efficient, consider pairing the depth limit with the -np (no-parent) flag. This prevents wget from traveling up to parent directories, ensuring you only download files below the specific directory level you targeted.

wget -r -l 2 -np https://example.com/blog/

Additionally, using the -H (span hosts) flag carefully is crucial. By default, wget will not follow links to different domain names. If you turn on spanning hosts with -H, keeping a strict -l depth limit becomes absolutely vital to prevent wget from attempting to crawl the entire internet.