Image from needpix.com.

Using wget To Mirror Websites That Require Authentication

This article explains how to use wget to mirror a website that requires authentication by saving the necessary login cookies.

Before we get started, if you’re mirroring a site that doesn’t belong to you, always keep the following in mind, whether you’re using wget or another tool:

  1. Always obey the site’s robots.txt file. If they have one, it will be located at www.example.com/robots.txt. wget will respect robots.txt automatically.
  2. Read, and adhere to, the site’s terms and conditions. They may forbid crawling.
  3. Be respectful. If you’re using wget, use the --wait option to specify a delay between requests so you’re not hammering their site.
  4. Don’t republish or sell their content. It’s probably a copyright violation.
  5. Don’t crawl parts of that site that you don’t need, especially if they contain large files. If you’re using wget, you can use --exclude /big-files to exclude files in the /big-files directory.

Most website authentication works more or less the same way. Users enter their username and password into a login form’s text fields and click a “Login” button. The button press triggers an HTTP POST request to the server with the username and password as data. In response, if the username and password are correct, the server response contains a Set-Cookie header that includes an authentication cookie to be presented in subsequent website requests.

You can see this in action by using Firefox’s Web Developer tool (Firefox > Tools > Web Developer > Network) or Chrome’s DevTools (View > Developer > Developer Tools). When you view the network traffic, you should be able to see the Set-Cookie header in the response headers when you login, and a Cookie header in the request headers of subsequent requests to the server.

First, you need to find the URL to post your credentials to. You can do this by viewing the source of the login page and looking for the login form’s action, and the names of the username and password fields. For example, if the login form for example.com looks like this:

<form action="/user/login" method="post">
<input type="text" placeholder="Username" name="username">
<input type="password" placeholder="Password" name="password">
<button type="submit">Login</button>
</form>

Then the endpoint you’ll post to is https://www.example.com/user/login, and the names of the username and password form fields are (unsurprisingly) username and password.

If your username is me@example.com, and your password is password123, the wget command to login would be:

wget --savecookies cookies.txt \
--post-data=”username=me@example.com&password=password123" \
https://www.example.com/user/login

You can also save the values in a file and use the --post-file option, instead:

cat "username=me@example.com&password=password123" > login.txt
wget --save-cookies cookies.txt --post-file=login.txt \
https://www.example.com/user/login

Warning: in both cases, the values for both username and password have to be percent-encoded. For example, the password pas$w/rd would appear as pas%24w%2Frd.

If your login was successful, your cookies.txt file should contain the cookies that will be used for subsequent GET requests. It should look something like this:

cat cookies.txt.example.com  FALSE  /  FALSE  /159325864  SESSIONID=99a2839f283b47

There may be additional cookies in there, as well. You don’t really need to know what these values mean, as wget will take care of that for you. You can read more about the cookie format here, if you’re interested. Your values will all be different, including the cookie name.

Now that you have your authentication cookie(s), you should be able to use wget to retrieve password-protected pages:

wget --load-cookies cookies.txt https://www.example.com/password-protected.html

You should now be able to mirror the entire site, including pages requiring authentication, like this:

wget --mirror --load-cookies cookies.txt https://www.example.com/

The cookies will eventually expire, so if you want to mirror that site again, you might have to repeat the login step.

One gotcha that I write about elsewhere is that wget can log you out by crawling certain links. For example, loading example.com/user/logout may end your session. wget will crawl that link like any other, and the cookies that you set when you logged in will be wiped out.

The actual URL used to logout will differ by site. You can discover the URL by either inspecting the source of a page that has a logout link, or by hovering over any logout link.

One way to handle this is to prevent wget from trying to crawl that page by using the --exclude option:

wget --mirror --exclude /user --load-cookies cookies.txt \
https://www.example.com

This will prevent wget from crawling any links under /user, including the logout link.

This technique should work for most sites, but not all.

There are ways that you can get logged out from the site, in addition to crawling the logout link. For example, your session cookie may expire, or the website might terminate your session because you’ve exceeded a rate limit, or some other reason.

There are also other options for wget that come in handy when mirroring sites, such as --convert-links, --page-requisites, and --adjust-extension. These are outside the scope of this article, and will be dealt with elsewhere.

Software developer, researcher on online hate speech, extremism, and radicalization. https://www.malmer.com/