wget To Mirror Websites That Require Authentication
This article explains how to use
wget to mirror a website that requires authentication by saving the necessary login cookies.
Before we get started, if you’re mirroring a site that doesn’t belong to you, always keep the following in mind, whether you’re using
wget or another tool:
- Always obey the site’s
robots.txtfile. If they have one, it will be located at
- Read, and adhere to, the site’s terms and conditions. They may forbid crawling.
- Be respectful. If you’re using
wget, use the
--waitoption to specify a delay between requests so you’re not hammering their site.
- Don’t republish or sell their content. It’s probably a copyright violation.
- Don’t crawl parts of that site that you don’t need, especially if they contain large files. If you’re using
wget, you can use
--exclude /big-filesto exclude files in the
How Website Authentication Works
Most website authentication works more or less the same way. Users enter their username and password into a login form’s text fields and click a “Login” button. The button press triggers an
HTTP POST request to the server with the username and password as data. In response, if the username and password are correct, the server response contains a
Set-Cookie header that includes an authentication cookie to be presented in subsequent website requests.
You can see this in action by using Firefox’s Web Developer tool (
Firefox > Tools > Web Developer > Network) or Chrome’s DevTools (
View > Developer > Developer Tools). When you view the network traffic, you should be able to see the
Set-Cookie header in the response headers when you login, and a
Cookie header in the request headers of subsequent requests to the server.
wget To Login
First, you need to find the URL to post your credentials to. You can do this by viewing the source of the login page and looking for the login form’s
action, and the names of the username and password fields. For example, if the login form for
example.com looks like this:
<form action="/user/login" method="post">
<input type="text" placeholder="Username" name="username">
<input type="password" placeholder="Password" name="password">
If your username is
email@example.com, and your password is
wget command to login would be:
wget --savecookies cookies.txt \
You can also save the values in a file and use the
--post-file option, instead:
cat "firstname.lastname@example.org&password=password123" > login.txt
wget --save-cookies cookies.txt --post-file=login.txt \
Warning: in both cases, the values for both username and password have to be percent-encoded. For example, the password
pas$w/rd would appear as
If your login was successful, your
cookies.txt file should contain the cookies that will be used for subsequent
GET requests. It should look something like this:
cat cookies.txt.example.com FALSE / FALSE /159325864 SESSIONID=99a2839f283b47
There may be additional cookies in there, as well. You don’t really need to know what these values mean, as
wget will take care of that for you. You can read more about the cookie format here, if you’re interested. Your values will all be different, including the cookie name.
Presenting Your Authentication Cookies
Now that you have your authentication cookie(s), you should be able to use
wget to retrieve password-protected pages:
wget --load-cookies cookies.txt https://www.example.com/password-protected.html
You should now be able to mirror the entire site, including pages requiring authentication, like this:
wget --mirror --load-cookies cookies.txt https://www.example.com/
The cookies will eventually expire, so if you want to mirror that site again, you might have to repeat the login step.
Don’t Let Yourself Get Logged Out
One gotcha that I write about elsewhere is that
wget can log you out by crawling certain links. For example, loading
example.com/user/logout may end your session.
wget will crawl that link like any other, and the cookies that you set when you logged in will be wiped out.
The actual URL used to logout will differ by site. You can discover the URL by either inspecting the source of a page that has a logout link, or by hovering over any logout link.
One way to handle this is to prevent
wget from trying to crawl that page by using the
wget --mirror --exclude /user --load-cookies cookies.txt \
This will prevent
wget from crawling any links under
/user, including the logout link.
This technique should work for most sites, but not all.
There are ways that you can get logged out from the site, in addition to crawling the logout link. For example, your session cookie may expire, or the website might terminate your session because you’ve exceeded a rate limit, or some other reason.
There are also other options for
wget that come in handy when mirroring sites, such as
--adjust-extension. These are outside the scope of this article, and will be dealt with elsewhere.