wget --convert-links Isn’t Converting Your Links
Hopefully this will save someone the frustration that I just went through trying to get
wget --mirror --convert-links to function properly. TLDR; it will, when it’s done with the entire site.
wget is one of the common tools used to mirror websites. In my case, I wanted to mirror my website to my laptop, and then browse the content directly from the file system instead of using a web server.
<link rel=”stylesheet” type=”text/css” href=”/css/main.css” />
Your browser won’t be able to find it, since
wget downloads assets into a directory relative to the mirrored content.
In order to deal with this,
wget has an option,
--convert-links, which is supposed to convert asset links so that they’re suitable for local viewing. From GNU’s documentation:
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
In fact, if you download a single file, like this:
wget --convert-links https://www.example.com/index.html
it will correctly change the above stylesheet reference to:
<link rel=”stylesheet” type=”text/css” href=”css/main.css” />
Note that the leading directory separator is missing, changing the path from absolute to relative. This allows your web browser to find the assets in the directory structure that
wget created while mirroring your site.
wget with a single file works fine, mirroring a large site is a different story.
GNU’s documentation on
wget gives this as an example of mirroring a site for local viewing:
wget --mirror --convert-links --backup-converted \
https://www.gnu.org/ -o /home/me/weeklog
If you try this on a large site, and then check the downloaded html files during the mirroring process to make sure that it’s functioning properly, you’ll see that the URLs of assets haven’t been changed.
The confusion has to do with some ambiguity in GNU’s documentation:
After the download is complete, convert the links in the document to make them suitable for local viewing.
I read that as “after the download of the file is complete.” What it actually means is “after the download of the site is complete.” For large sites, this can take hours or days, and you just have to trust that the links will be converted at the end.
I confirmed this by inspecting the source code, which is available at
This is the comment for the
convert_all_files() function in
/* This function is called when the retrieval is done to convert the
links that have been downloaded. It has to be called at the end of
the retrieval, because only then does Wget know conclusively which
URLs have been downloaded, and which not, so it can tell which
direction to convert to...
And we can confirm in
main.c that it’s called immediately before exiting:
if ((opt.convert_links || opt.convert_file_only) && !opt.delete_after)
exit (get_exit_status ());
So, there you have it. If you’re patient,
wget really will convert your links. Hopefully.
For what it’s worth, there’s a similar program called
httrack that many people use as an alternative to
wget, in part because it converts links without waiting for the end. The authors of
wget seem to believe that you must wait until the end, so there may be some cases that
httrack doesn’t handle correctly.