How to Exclude Directories in Wget Recursive Download?
When downloading websites or large repositories recursively using the
wget command-line tool, you often want to skip specific
folders like asset libraries, logs, or private directories. This article
provides a quick guide on how to use the --reject-regex,
-X (--exclude-directories), and
--no-parent flags to precisely control which directories
are ignored during a recursive download, saving you bandwidth and
storage space.
The Standard Method:
Using the -X Flag
The most common way to exclude specific directories during a
recursive download is by using the -X (or
--exclude-directories) option. This flag accepts a
comma-separated list of directories that you want wget to
skip.
wget -r -X /themes,/plugins https://example.comIn this example:
-renables recursive downloading.-X /themes,/pluginsinstructswgetto completely ignore those specific paths on the server.
Important Note: The directory paths passed to the
-Xflag must be relative to the root of the domain or host, starting with a forward slash (/), even if you are starting the download from a subfolder.
Excluding Directories with Wildcards
If you want to exclude directories that follow a certain naming
pattern, you can use basic wildcards (like *) within the
-X flag.
wget -r -X "/*/images,/*/backup*" https://example.comWhen using wildcards, always enclose the directory list in quotes to
prevent your local shell from expanding the asterisks before the command
is sent to wget.
Advanced Filtering with Regular Expressions
For highly specific or complex exclusion rules, wget
supports regular expressions via the --reject-regex flag.
This is incredibly useful if you want to skip directories that match a
specific pattern regardless of where they sit in the folder
hierarchy.
wget -r --reject-regex "(dev|test|private)/" https://example.comThis command will skip any directory named dev,
test, or private anywhere in the URL path
structure during the recursive crawl.
Preventing Upward
Navigation with --no-parent
A common issue during recursive downloads is wget
following links that lead to a parent directory, inadvertently
downloading the entire server. To restrict the download strictly to the
current directory and its subdirectories, combine your commands with the
-np (or --no-parent) flag.
wget -r -np -X /blog/archive https://example.com/blog/By combining -np with -X, you ensure that
wget only moves deeper into the /blog/
structure while successfully skipping the /blog/archive
folder.