How to Force Wget to Download HTML with Different Extensions
This article provides a quick overview of how to compel the
wget command-line utility to treat a remote resource as an
HTML file, regardless of its file extension or the Content-Type header
sent by the server. You will learn the exact flags required to override
default download behaviors, how to force input parsing, and practical
examples for handling misconfigured servers or scraped data.
Understanding the Wget Extension Challenge
By default, wget relies on the URL structure and the
HTTP Content-Type header delivered by the web server to
determine how to handle and save a file. If a server serves an HTML
document disguised with a .txt, .php, or
entirely missing extension, wget may mirror that exact
naming convention or fail to parse it properly if you are attempting to
recursively download links.
To bypass this, you need to use specific command-line switches that
dictate how wget interprets the incoming data stream.
The Force HTML Option
(-F or --force-html)
When you are using wget to read a local or remote file
as an input file to look for further links to download, it expects
standard HTML. If the file does not have an .html or
.htm extension, wget will look at it as plain
text and refuse to parse it.
To force wget to treat the input file as HTML, use the
-F flag:
wget -F -i disguised_list.txtIn this scenario:
-itellswgetto read URLs from the specified file.-Fforceswgetto parsedisguised_list.txtas HTML, allowing it to extract and follow links even though the file ends in.txt.
Adjusting
the Saved File Extension (--adjust-extension)
If your goal is to download a page that is dynamically generated
(like a .php or .cgi script) but actually
outputs pure HTML, you want the local file to be saved with a proper
.html extension so it opens correctly in a browser.
You can achieve this using the --adjust-extension flag
(formerly -E):
wget --adjust-extension http://example.com/page.phpIf the server response headers indicate that the file is
text/html, wget will automatically append or
change the local file extension to .html, resulting in a
file named page.php.html.
Forcing a Specific
Output File Name (-O)
If the remote server is misconfigured, sending the wrong content
type, and using an incorrect extension, the most reliable manual
override is the output document option (-O). This allows
you to explicitly name the file and dictate its extension locally,
regardless of what the server intended.
wget -O downloaded_page.html http://example.com/weird-extension.dat