How to Exclude Directories in Wget Recursive Download?

When downloading websites or large repositories recursively using the wget command-line tool, you often want to skip specific folders like asset libraries, logs, or private directories. This article provides a quick guide on how to use the --reject-regex, -X (--exclude-directories), and --no-parent flags to precisely control which directories are ignored during a recursive download, saving you bandwidth and storage space.

The Standard Method: Using the -X Flag

The most common way to exclude specific directories during a recursive download is by using the -X (or --exclude-directories) option. This flag accepts a comma-separated list of directories that you want wget to skip.

wget -r -X /themes,/plugins https://example.com

In this example:

Important Note: The directory paths passed to the -X flag must be relative to the root of the domain or host, starting with a forward slash (/), even if you are starting the download from a subfolder.

Excluding Directories with Wildcards

If you want to exclude directories that follow a certain naming pattern, you can use basic wildcards (like *) within the -X flag.

wget -r -X "/*/images,/*/backup*" https://example.com

When using wildcards, always enclose the directory list in quotes to prevent your local shell from expanding the asterisks before the command is sent to wget.

Advanced Filtering with Regular Expressions

For highly specific or complex exclusion rules, wget supports regular expressions via the --reject-regex flag. This is incredibly useful if you want to skip directories that match a specific pattern regardless of where they sit in the folder hierarchy.

wget -r --reject-regex "(dev|test|private)/" https://example.com

This command will skip any directory named dev, test, or private anywhere in the URL path structure during the recursive crawl.

Preventing Upward Navigation with --no-parent

A common issue during recursive downloads is wget following links that lead to a parent directory, inadvertently downloading the entire server. To restrict the download strictly to the current directory and its subdirectories, combine your commands with the -np (or --no-parent) flag.

wget -r -np -X /blog/archive https://example.com/blog/

By combining -np with -X, you ensure that wget only moves deeper into the /blog/ structure while successfully skipping the /blog/archive folder.