How Does Wget Strict-Looking Work in Recursive Downloads?

The --strict-comments option in wget directly influences how the tool parses HTML comments during recursive downloads, strictly enforcing the W3C specification for comment boundaries. By default, wget uses a relaxed parsing method that can sometimes mistake regular text for comments or vice versa, potentially causing it to miss valid hyperlinks embedded in the code. Activating this option ensures that wget only ignores text that perfectly conforms to strict HTML comment rules, thereby altering which URLs are discovered and followed during a recursive site crawl.

Understanding Wget’s Default HTML Parsing

When you initiate a recursive download using wget -r, the tool downloads a page, parses its HTML content to find hyperlinks (like <a href="...">), and then adds those links to its download queue.

By default, wget is highly forgiving of poorly formatted HTML. This “loose” parsing extends to how it handles HTML comments (`). In standard browsing, many web engines skip over malformed comments to render the page anyway.wget` tries to mimic this tolerance, but its default parser can sometimes get confused by complex scripts, inline style tags, or nested hyphens inside a webpage, leading it to prematurely stop scanning parts of the document for downloadable links.

The Role of the Strict Option

When you append the --strict-comments flag to your command, you instruct wget to switch from its flexible, error-tolerant parsing mode to a strict compliance mode.

Accurate Link Discovery

Under strict parsing, wget adheres to the exact SGML/HTML rules where comments are delimited precisely by pairs of hyphens. If a web developer utilized non-standard comment formatting, the default wget settings might mistakenly classify actual page content—including important URLs—as part of a giant, unparsed comment. Enabling strict parsing forces wget to look past the sloppy syntax, recognize the text as active HTML, and successfully discover the links inside it.

Preventing False Positives

Conversely, if a page contains text that looks vaguely like a comment but isn’t meant to be one, default wget might skip it. The strict option ensures that text is only ignored if it matches the formal definition of an HTML comment. This prevents wget from accidentally skipping legitimate sections of a webpage during its recursive sweep.

Impact on Recursive Download Outcomes

Using this option can lead to two distinct outcomes depending on how the target website was coded:

Expanded Scrapes: You may find that wget downloads significantly more files than it did without the flag, because it successfully parsed links that were previously trapped inside poorly formatted, pseudo-comment blocks.
Restricted Scrapes: In rare cases where valid links were accidentally exposed by loose parsing but should have been hidden, strict mode will properly ignore them, resulting in a cleaner, more accurate mirror of the intended site structure.

If you notice that your recursive downloads are missing entire sections of a website, or if wget seems to stop scanning a page unexpectedly, enabling --strict-comments is an effective troubleshooting step to align the tool’s parser with the exact structural layout of the source code.