How to Limit Recursion Depth in Wget?
When downloading websites recursively using the wget
command-line tool, it is easy to accidentally trigger an infinite loop
or download far more data than intended. This guide provides a
straightforward solution for limiting the recursion depth to a specific
number of levels, explains the key command-line flags required, and
covers best practices for managing web scraping depth effectively.
The Short Answer: Using the
-l Flag
The most direct way to control the depth of a recursive download in
wget is by using the --level option, or its
short-form counterpart, -l. By default, when you trigger a
recursive download, wget sets a maximum depth limit of 5
levels.
To specify your own limit, use the following syntax:
wget -r -l <depth_number> <URL>
For example, if you only want to download the main page and the pages directly linked from it, you would set the depth to 1:
wget -r -l 1 https://example.com
Understanding Key Parameters
To gain full control over your download, it helps to understand how these specific flags interact:
-r(Recursive): This turns on recursive downloading, tellingwgetto follow links found on the target page. Without this flag, the depth limit flag has no effect.-l <number>(Level): This specifies the maximum depth.wgetstarts at depth 0 (the initial URL provided). Links found on that initial page are at depth 1, links on those pages are at depth 2, and so on.- Infinite Depth (
-l 0or-l inf): If you truly want no depth limit, you can pass0orinfas the value. Use this with extreme caution, as it can quickly overwhelm your local storage and the target server.
Practical Examples of Depth Control
Different scraping scenarios require different levels of depth. Here is how to adjust your command based on common needs:
| Depth Level | Command Example | Use Case |
|---|---|---|
| Depth 1 | wget -r -l 1 https://example.com |
Grabbing a landing page and its immediate sub-pages only. |
| Depth 3 | wget -r -l 3 https://example.com |
Downloading a specific multi-tiered documentation section. |
| Infinite | wget -r -l 0 https://example.com |
Full website mirroring (ensure you have permission and space). |
Important Safeguards to Pair with Recursion Limits
Limiting the depth is a great first step, but deep recursive
downloads can still cause issues if you don’t restrict wget
from wandering onto external websites.
To keep your recursive download safe and efficient, consider pairing
the depth limit with the -np (no-parent)
flag. This prevents wget from traveling up to parent
directories, ensuring you only download files below the specific
directory level you targeted.
wget -r -l 2 -np https://example.com/blog/
Additionally, using the -H (span hosts)
flag carefully is crucial. By default, wget will not follow
links to different domain names. If you turn on spanning hosts with
-H, keeping a strict -l depth limit becomes
absolutely vital to prevent wget from attempting to crawl
the entire internet.