How to Force Wget to Download HTML with Different Extensions

This article provides a quick overview of how to compel the wget command-line utility to treat a remote resource as an HTML file, regardless of its file extension or the Content-Type header sent by the server. You will learn the exact flags required to override default download behaviors, how to force input parsing, and practical examples for handling misconfigured servers or scraped data.

Understanding the Wget Extension Challenge

By default, wget relies on the URL structure and the HTTP Content-Type header delivered by the web server to determine how to handle and save a file. If a server serves an HTML document disguised with a .txt, .php, or entirely missing extension, wget may mirror that exact naming convention or fail to parse it properly if you are attempting to recursively download links.

To bypass this, you need to use specific command-line switches that dictate how wget interprets the incoming data stream.

The Force HTML Option (-F or --force-html)

When you are using wget to read a local or remote file as an input file to look for further links to download, it expects standard HTML. If the file does not have an .html or .htm extension, wget will look at it as plain text and refuse to parse it.

To force wget to treat the input file as HTML, use the -F flag:

wget -F -i disguised_list.txt

In this scenario:

Adjusting the Saved File Extension (--adjust-extension)

If your goal is to download a page that is dynamically generated (like a .php or .cgi script) but actually outputs pure HTML, you want the local file to be saved with a proper .html extension so it opens correctly in a browser.

You can achieve this using the --adjust-extension flag (formerly -E):

wget --adjust-extension http://example.com/page.php

If the server response headers indicate that the file is text/html, wget will automatically append or change the local file extension to .html, resulting in a file named page.php.html.

Forcing a Specific Output File Name (-O)

If the remote server is misconfigured, sending the wrong content type, and using an incorrect extension, the most reliable manual override is the output document option (-O). This allows you to explicitly name the file and dictate its extension locally, regardless of what the server intended.

wget -O downloaded_page.html http://example.com/weird-extension.dat