Scripting browser-like tasks

curl can do almost every HTTP operation and transfer your favorite browser can. It can actually do a lot more than so as well, but in this chapter we focus on the fact that you can use curl to reproduce, or script, what you would otherwise have to do manually with a browser.

Here are some tricks and advice on how to proceed when doing this.

Figure out what the browser does

This is really a necessary first step. Second-guessing what it does risks having you chase down the wrong problem rat-hole. The scientific approach to this problem pretty much requires that you first understand what the browser does.

To learn what the browser does to perform a certain task, you can either read the HTML pages that you operate on and with a deep enough knowledge you can see what a browser would do to accomplish it and then start trying to do the same with curl.

The slightly more effective way, that also works even for the cases when the page is shock-full of obfuscated JavaScript, is to run the browser and monitor what HTTP operations it performs.

The Copy as curl section describes how you can record a browser's request and easily convert that to a curl command line.

Those copied curl command lines are often not good enough though since they tend to copy exactly that request, while you probably want to be a tad bit more dynamic so that you can reproduce the same operation and not resend the verbatim request.

Cookies

A lot of the web today works with a username and password login prompt somewhere. In many cases you even logged in a while ago with your browser but it has kept the state and keeps you logged in.

The logged-in state is almost always done by using cookies. A common operation would be to first login and save the returned cookies in a file, and then let the site update the cookies in the subsequent command lines when you traverse the site with curl.

Web logins and sessions

The site at https://example.com/ features a login prompt. The login on the web site is an HTML form to which you send a HTTP POST to. Save the response cookies and the response (HTML) output.

Although the login page is visible (if you would use a browser) on https://example.com/, the HTML form tag on that page informs you about which exact URL to send the POST to, using the action parameter.

In our imaginary case, the form tag looks like this:

<form action="login.cgi" method="POST">
  <input type="text" name="user">
  <input type="password" name="secret">
  <input type="hidden" name="id" value="bc76">
</form>

There are three fields of importance. user, secret and id. The last one, the id, is marked hidden which means that it does not show up in the browser and it is not a field that a user fills in. It is generated by the site itself, and for your curl login to succeed, you need extract that value and use that in your POST submission together with the rest of the data.

Send correct contents to the fields to the correct destination URL:

curl -d user=daniel -d secret=qwerty -d id=bc76 \
  https://example.com/login.cgi -o out

Many login pages even send you a session cookie already when presenting the login, and since you often need to extract the hidden fields from the <form> tag anyway, you could do something like this first:

curl -c cookies https://example.com/ -o loginform

You would often need an HTML parser or some scripting language to extract the id field from there and then you can proceed and login as mentioned above, but with the added cookie loading (I am splitting the line into two lines to make it more readable):

curl -d user=daniel -d secret=qwerty -d id=bc76 \
  https://example.com/login.cgi -b cookies -c cookies -o out

You can see that it uses both -b for reading cookies from the file and -c to store cookies again, for when the server sends back updated cookies.

Always, always, add -v to the command lines when working out the details. See also the verbose section for more details on that.

Redirects

It is common for servers to use redirects when responding to a login POST. It is so common I would probably say it is rare that it is not solved with a redirect.

You then need to remember that curl does not follow redirects automatically. You need to instruct it to do this by adding the -L command line option. Adding that to the previous command line then makes the full one look like:

curl -d user=daniel -d secret=qwerty -d id=bc76 \
  https://example.com/login.cgi -b cookies -c cookies -L -o out

In the above example command lines, we save the login response output in a file named 'out' and in your script you should probably verify that it contains some text or something that confirms that the login is successful.

Once successfully logged in, get the files or perform the HTTP operations you need and remember to keep using both -b and -c on the command lines to use and update the cookies.

Referer

Some sites verify that the Referer: is actually identifying the legitimate parent URL when you request something or when you login or similar. You can then inform the server from which URL you arrived by using -e https://example.com/ etc. Appending that to the previous login attempt then makes it:

curl -d user=daniel -d secret=qwerty -d id=bc76 \
  https://example.com/login.cgi \
  -b cookies -c cookies -L -e "https://example.com/" -o out

TLS fingerprinting

Anti-bot detections nowadays use TLS fingerprinting to figure out whether a request is coming from a browser. curl's fingerprint can vary depending on your environment and most likely is different from those of browsers. curl's CLI does not have options to change all the various parts of the fingerprint, however an advanced user can customize the fingerprint through the use of libcurl and by compiling curl from source themselves.

everything curl