Using wget
To Mirror Websites That Require Authentication
This article explains how to use wget
to mirror a website that requires authentication by saving the necessary login cookies.
Before we get started, if you’re mirroring a site that doesn’t belong to you, always keep the following in mind, whether you’re using wget
or another tool:
- Always obey the site’s
robots.txt
file. If they have one, it will be located atwww.example.com/robots.txt
.wget
will respectrobots.txt
automatically. - Read, and adhere to, the site’s terms and conditions. They may forbid crawling.
- Be respectful. If you’re using
wget
, use the--wait
option to specify a delay between requests so you’re not hammering their site. - Don’t republish or sell their content. It’s probably a copyright violation.
- Don’t crawl parts of that site that you don’t need, especially if they contain large files. If you’re using
wget
, you can use--exclude /big-files
to exclude files in the/big-files
directory.
How Website Authentication Works
Most website authentication works more or less the same way. Users enter their username and password into a login form’s text fields and click a “Login” button. The button press triggers an HTTP POST
request to the server with the username and password as data. In response, if the username and password are correct, the server response contains a Set-Cookie
header that includes an authentication cookie to be presented in subsequent website requests.
You can see this in action by using Firefox’s Web Developer tool (Firefox > Tools > Web Developer > Network
) or Chrome’s DevTools (View > Developer > Developer Tools
). When you view the network traffic, you should be able to see the Set-Cookie
header in the response headers when you login, and a Cookie
header in the request headers of subsequent requests to the server.
Using wget
To Login
First, you need to find the URL to post your credentials to. You can do this by viewing the source of the login page and looking for the login form’s action
, and the names of the username and password fields. For example, if the login form for example.com
looks like this:
<form action="/user/login" method="post">
<input type="text" placeholder="Username" name="username">
<input type="password" placeholder="Password" name="password">
<button type="submit">Login</button>
</form>
Then the endpoint you’ll post to is https://www.example.com/user/login
, and the names of the username and password form fields are (unsurprisingly) username
and password
.
If your username is me@example.com
, and your password is password123
, the wget
command to login would be:
wget --savecookies cookies.txt \
--post-data=”username=me@example.com&password=password123" \
https://www.example.com/user/login
You can also save the values in a file and use the --post-file
option, instead:
cat "username=me@example.com&password=password123" > login.txt
wget --save-cookies cookies.txt --post-file=login.txt \
https://www.example.com/user/login
Warning: in both cases, the values for both username and password have to be percent-encoded. For example, the password pas$w/rd
would appear as pas%24w%2Frd
.
If your login was successful, your cookies.txt
file should contain the cookies that will be used for subsequent GET
requests. It should look something like this:
cat cookies.txt.example.com FALSE / FALSE /159325864 SESSIONID=99a2839f283b47
There may be additional cookies in there, as well. You don’t really need to know what these values mean, as wget
will take care of that for you. You can read more about the cookie format here, if you’re interested. Your values will all be different, including the cookie name.
Presenting Your Authentication Cookies
Now that you have your authentication cookie(s), you should be able to use wget
to retrieve password-protected pages:
wget --load-cookies cookies.txt https://www.example.com/password-protected.html
You should now be able to mirror the entire site, including pages requiring authentication, like this:
wget --mirror --load-cookies cookies.txt https://www.example.com/
The cookies will eventually expire, so if you want to mirror that site again, you might have to repeat the login step.
Don’t Let Yourself Get Logged Out
One gotcha that I write about elsewhere is that wget
can log you out by crawling certain links. For example, loading example.com/user/logout
may end your session. wget
will crawl that link like any other, and the cookies that you set when you logged in will be wiped out.
The actual URL used to logout will differ by site. You can discover the URL by either inspecting the source of a page that has a logout link, or by hovering over any logout link.
One way to handle this is to prevent wget
from trying to crawl that page by using the --exclude
option:
wget --mirror --exclude /user --load-cookies cookies.txt \
https://www.example.com
This will prevent wget
from crawling any links under /user
, including the logout link.
Final Thoughts
This technique should work for most sites, but not all.
There are ways that you can get logged out from the site, in addition to crawling the logout link. For example, your session cookie may expire, or the website might terminate your session because you’ve exceeded a rate limit, or some other reason.
There are also other options for wget
that come in handy when mirroring sites, such as --convert-links
, --page-requisites
, and --adjust-extension
. These are outside the scope of this article, and will be dealt with elsewhere.