Here’s Why wget --convert-links Isn’t Converting Your Links

Daniel Malmer
3 min readJun 13, 2020
Image from Wikimedia Commons.

Hopefully this will save someone the frustration that I just went through trying to get wget --mirror --convert-links to function properly. TLDR; it will, when it’s done with the entire site.

wget is one of the common tools used to mirror websites. In my case, I wanted to mirror my website to my laptop, and then browse the content directly from the file system instead of using a web server.

There’s a small problem with browsing web content from a file system, which is that links to assets like images, JavaScript, and CSS will probably be broken. For example, imagine the following appears in an HTML file:

<link rel=”stylesheet” type=”text/css” href=”/css/main.css” />

Your browser won’t be able to find it, since wget downloads assets into a directory relative to the mirrored content.

In order to deal with this, wget has an option,--convert-links, which is supposed to convert asset links so that they’re suitable for local viewing. From GNU’s documentation:

After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

In fact, if you download a single file, like this:

wget --convert-links https://www.example.com/index.html

it will correctly change the above stylesheet reference to:

<link rel=”stylesheet” type=”text/css” href=”css/main.css” />

Note that the leading directory separator is missing, changing the path from absolute to relative. This allows your web browser to find the assets in the directory structure that wget created while mirroring your site.

Although using wget with a single file works fine, mirroring a large site is a different story.

GNU’s documentation on wget gives this as an example of mirroring a site for local viewing:

wget --mirror --convert-links --backup-converted  \
https://www.gnu.org/ -o /home/me/weeklog

If you try this on a large site, and then check the downloaded html files during the mirroring process to make sure that it’s functioning properly, you’ll see that the URLs of assets haven’t been changed.

The confusion has to do with some ambiguity in GNU’s documentation:

After the download is complete, convert the links in the document to make them suitable for local viewing.

I read that as “after the download of the file is complete.” What it actually means is “after the download of the site is complete.” For large sites, this can take hours or days, and you just have to trust that the links will be converted at the end.

I confirmed this by inspecting the source code, which is available at https://ftp.gnu.org/gnu/wget/.

This is the comment for the convert_all_files() function in convert.c:

/* This function is called when the retrieval is done to convert the
links that have been downloaded. It has to be called at the end of
the retrieval, because only then does Wget know conclusively which
URLs have been downloaded, and which not, so it can tell which
direction to convert to...

And we can confirm in main.c that it’s called immediately before exiting:

if ((opt.convert_links || opt.convert_file_only) && !opt.delete_after)
convert_all_links ();

cleanup ();

exit (get_exit_status ());
}

So, there you have it. If you’re patient, wget really will convert your links. Hopefully.

For what it’s worth, there’s a similar program called httrack that many people use as an alternative to wget, in part because it converts links without waiting for the end. The authors of wget seem to believe that you must wait until the end, so there may be some cases that httrack doesn’t handle correctly.

Happy mirroring!

--

--

Daniel Malmer
Daniel Malmer

Written by Daniel Malmer

PhD student researching online hate speech, extremism, and radicalization. https://www.malmer.com/

Responses (2)