How to Make wget Follow Relative Links Only?
When mirroring a website or downloading specific directories using
wget, you often want to restrict the download to files
linked within the same structure. By default, wget may
follow absolute links that lead to external domains or parent
directories, cluttering your local download. To force wget
to strictly follow relative links and stay within a specific directory
hierarchy, you must combine recursive downloading with strict directory
and domain locking options.
The Command Solution
To achieve this, use the following command structure in your terminal:
wget --recursive --no-parent --level=inf --page-requisites --adjust-extension --convert-links --span-hosts=off http://example.com/target-directory/Key Flags Explained
--recursive(or-r): Turns on recursive downloading, which is essential for following links from the initial page.--no-parent(or-np): This is the most crucial flag for restrictingwgetto relative links within a specific path. It stops the utility from ever ascending to the parent directory, ensuring that only links pointing to the current directory or its subdirectories are followed.--span-hosts=off(or-Homitted): By ensuring host-spanning is turned off,wgetwill never follow absolute links that point to external domain names. It locks the scope entirely to the host specified in the URL.--level=inf(or-l inf): Specifies the recursion maximum depth. Setting it to infinite ensures you get all nested files within that relative structure, though you can change this to a specific number (e.g.,-l 3) if needed.--page-requisites(or-p): Tellswgetto download all the elements needed to display the page correctly (like images and stylesheets), even if they are located on a different server, though strict page-rendering rules will still honor your directory limits where applicable.--convert-links(or-K): Converts the links in the document to relative links after the download completes, making the mirrored site perfectly browsable offline.
Common Pitfalls to Avoid
When attempting to isolate downloads to relative paths, watch out for the following behaviors:
- Absolute Links to the Same Domain: If a page
contains a link like
href="/about",wgetviews this as a valid link on the same host. However, if you started your download athttp://example.com/portfolio/, the--no-parentflag will successfully blockwgetfrom downloading/aboutbecause it sits outside the/portfolio/directory hierarchy. - Subdomains: By default,
wgettreats subdomains (likeblog.example.com) as entirely different hosts. If your relative paths point across subdomains, they will be blocked unless you explicitly use the--domainsor--span-hostsflags, though doing so relaxes the strictness of your download scope.