In the first entry of the #dstexts series, I ditch old timer RCurl for the new, shiny curl and talk about my five criteria for choosing R packages.
Today’s text message is from my good friend Pablo. Pablo is currently in the last months of his PhD in Survey Research at the University of Salamanca, Spain. I know him from my Erasmus year at the University of Essex where we were flatmates and both took classes in survey research. Originally a SPSS / Stata guy, he has been using R more and more over the last few years and I’ve been his personal “R guru”. Which is probably my dream job, tbh.
Anyway, to the text message (excuse the weird highlighting, still figuring that one out):
Pablo Cabrera Alvarez, [17.05.19 12:42]
Hi Frie
Pablo Cabrera Alvarez, [17.05.19 12:43]
I'm desperate with something I need your help
Pablo Cabrera Alvarez, [17.05.19 12:43]
😭😭😭😭
Frie, [17.05.19 12:44]
Oh no what
Frie, [17.05.19 12:44]
Is happening
Pablo Cabrera Alvarez, [17.05.19 12:45]
look, I have this webpage from which I want to download content: download.files() That's ok
[SOME UNHELPFUL BANTER FROM MY SIDE]
Pablo Cabrera Alvarez, [17.05.19 12:45]
My problem is that the webpage needs "authentication"
Frie, [17.05.19 12:45]
Oh OK
Pablo Cabrera Alvarez, [17.05.19 12:45]
I have the credentials
Frie, [17.05.19 12:45]
Yes
Frie, [17.05.19 12:45]
Ah
Frie, [17.05.19 12:45]
Mh
Pablo Cabrera Alvarez, [17.05.19 12:45]
I have tried with Rcurl
Frie, [17.05.19 12:45]
And?
Pablo Cabrera Alvarez, [17.05.19 12:46]
but it looks like the SSL protocol is different
Pablo Cabrera Alvarez, [17.05.19 12:46]
look, this si the error
Frie, [17.05.19 12:46]
Yes
Frie, [17.05.19 12:46]
Can you send me the command?
Frie, [17.05.19 12:46]
I have a bit time to look into it
Pablo Cabrera Alvarez, [17.05.19 12:46]
x <- getURL("https://THISWEBSITE/THISFILE.zip", userpwd="USER:PASSWORD6", httpauth = 4)
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
In summary, Pablo wanted to use R to download a zip file from the Internet. Of course, he could’ve just downloaded it manually via the browser and put it into his data
directory. But doing this in code is actually nice because it increases reproducability and at the same time documents where the data is coming from.
Usually you can achieve this in R by simply using download.file
. However, when the file is in any way protected, things get a little bit more complicated. In this case, the file was protected with so called “basic auth”. Basic authentication just means plain old username and password. If you have ever had an ugly looking popup asking you for username and password, that was probably Basic Auth. In those cases, you often have to use a curl
wrapper in R. curl
is broadly speaking a software for “transferring data in various protocols” (Wikipedia). It consists of a C library called libcurl
and a command-line tool called curl
.
Enough background info. Let’s get to how I solved it.
(If you want to skip the story, go straight to the solution.) My initial reaction was: “oh boy, this looks nasty.” I had never seen any error like this before. I knew that an tlsv1 alert protocol version
error was probably not coming from a simple mistake that would be easy to a) debug and b) fix. At least not for me.
What I did know was that the last time I personally had used the RCurl
package had been in 2014. Since then, I had managed with just using httr
. But I also remembered that there was a newer R package called curl
.
In the end, my debugging strategy was:
curl
to rule out server-side errors or errors at the system library level.curl
is successful, use R package curl
.As this conversation happened right at the end of my lunch break (hi, boss, if you ever read this :wave:) and I did not have much time left, I decided to skip 1) and go straight to 2).
(Editing Frie: The following is how I think my process was. Maybe it was totally different?!?! Next time, I’ll screen-record.)
I installed the curl
R package on my machine. Next up was probably googling “curl R package” which led me to its website. Right at the start is a summary of the most important functions:
curl_fetch_memory() saves response in memory
curl_download() or curl_fetch_disk() writes response to disk
curl() or curl_fetch_stream() streams response data
curl_fetch_multi() (Advanced) process responses via callback functions
It took me some minutes of not very carefully reading to comprehend that what I needed was curl_download
. After I had realized this, I headed back to RStudio and typed ?curl::curl_download
in the console to open the help.
From the Description:
Libcurl implementation of C_download (the “internal” download method) with added support for https, ftps, gzip, etc. Default behavior is identical to download.file, but request can be fully configured by passing a custom handle.
“fully configured” sounded good, so I had a look at the Usage section:
From this, it was clear to me where I would need to insert the URL (url
) and how I could specify the destination file (destfile
). What was not so clear to me was how I could pass the username and password required for basic authentication. But by process of elimination, it became clear to me that it probably had to go into the handle
argument:
url
: probably the URL we want to download fromdestfile
: probably the file we want to write toquiet
: no idea but a boolean will not work for username/password. Plus, “quiet” has nothing to do with authenticationmode
: from looking at the default argument ("wb"
), probably something with the file mode.So, handle
was the only one left. Plus, I vaguely remembered configuring so-called handle objects back when using RCurl
.
What I had found out so far:
I took back to Firefox to find out more about the handle
, specifically how to pass basic authentication details to it. Because I couldn’t find the needed information on the detailed project website just by skimming (why read carefully if you can just jump around?), I tried the project’s GitHub page. Still, no luck as the “Hello World” examples only covered setting HTTP request headers but not authentication. So finally, I took the time to more carefully read the package website and alas, there was a section on “Configuring a handle”.
Creating a new handle is done using new_handle. After creating a handle object, we can set the libcurl options and http request headers.
Use the curl_options() function to get a list of the options supported by your version of libcurl. The libcurl documentation explains what each option does. Option names are not case sensitive.
“Curl options” sounded good: Over the course of the last 1.5 years, I have written a lot of curl
requests in the terminal, e.g. to do quick checks on databases. From this experience, I know that there are command line options for setting basic authentication in the terminal curl
command, so there should be underlying libcurl
equivalents because after all, terminal curl
relies on libcurl
. Does this even make sense?
Anyway, I got the options:
[1] 247
Of course, I entered curl::curl_options()
to see all the options. But because there are quite a lot and I want to save you from endlessly scrolling, I have added the length
for the purpose of this blog post. Getting all options printed out is left as an exercise to the reader. :wink: Because I didn’t have time to read all those 251 options, I decided to take the Google route again and try to find the name of the option on the Internet:
Nice! Especially the CURLOPT_USERPWD
immediately appealed to me because in his original RCurl
command, Pablo had a userpwd
argument as well. Without even checking the links, I headed back to R to find out whether there were any options matching those I found:
use_ssl useragent username userpwd
119 10018 10173 10005
verbose wildcardmatch writedata writefunction
41 197 10001 20011
xferinfofunction xoauth2_bearer
20219 10220
Bingo for userpwd
!
Now I was ready to set up my handle. From the package website, I knew that setting options was done with curl::handle_setopt
:
I crossed my fingers and executed the command. And it just worked - not something that usually happens to me. I saved the code in a file and sent it to Pablo, still not sure it’d work on his computer as well. But it did! How cool!
Frie, [17.05.19 13:01]
well does it work for starters? ;)
Frie, [17.05.19 13:01]
(as it depends on system library, could also not work on your machine)
Pablo Cabrera Alvarez, [17.05.19 13:01]
I owe you more than one dinner, believe me
Pablo Cabrera Alvarez, [17.05.19 13:01]
yes yes, I just tried
Pablo Cabrera Alvarez, [17.05.19 13:02]
it's perfect
After approximately 15 minutes, issue solved.:muscle:
However, there was still an open question:
Pablo Cabrera Alvarez, [17.05.19 13:01]
how did you know?? I have been three hours visiting forums and stuff
By that time, I really had to get back to work so my answer was a bit short and off-cutting. But it’s a good question that points to the importance of what I like to call “non-technical knowledge”. What I mean by this is having the knowledge to answer questions like:
Of course, technical skills help with answering those questions but it is not quite the same.
While I could talk about each of those questions for ages, let’s focus on the first two for the moment: How did I knew about the curl
package and why did I prefer it over RCurl
?
For me personally, the answer to the first question boils down to keeping up with the latest developments in R. I use Twitter for that purpose because the R community is quite active there (under the hashtag #rstats, not #R!) and I follow many many R users and developers. For all people who do not want to ruin their phone usage statistics, Maëlle Salmon has written a good blog post on “Keeping up to date with R news”. Among her recommendations are mailing lists, news aggregators like R-Bloggers or R Weekly, attending meetups and conferences and much more.
As for the second question - “do I use package x or y?” -, I think the following “rules” feed into my decision:
?
.Those “rules” are roughly in order of importance although I guess the order and relative importance of them differs depending on the specific case. Sometimes, there is a “popular” package as measured by the number of package downloads but it is just popular because it has been around forever. Sometimes, people with a lot of followers on Twitter produce shitty packages. And sometimes although very rarely nowadays, those “rules” just fail and I end up using a package with 10 downloads from 5 years ago. ¯\_(ツ)_/¯
For the RCurl
vs curl
case described above, it was a combination of 4. and 5. I knew from Twitter that there was a new package for curl operations from Jeroen. I had also heard a lot of praise about his work which I could only agree with after having used his openssl
and jose
packages for developing sealr
. The curl
package also had a nice project website + GitHub Readme and it was easy for me to check that Jeroen was still actively working on the package. In contrast, as I mentioned above, I had not used RCurl
since 2014 and it does not have a nice GitHub repository, only a old-school looking website that I actually only found after checking again for this blog post (nothing against old school but yeah).
Update 2019-05-22, 20:37: After posting about this post on Twitter, Jeroen was so kind to quote-tweet my tweet, confirming my suspicion about RCurl
being an outdated package:
I'm probably biased, but imo all #rstats users should make the switch to ‘curl’/‘httr’ asap. The old ‘RCurl’ pkg has been unmaintained for years and is broken beyond repair. It's unfortunate this is unknown to many new users. https://t.co/kmlWTicYHN
— Jeroen Ooms (@opencpu) May 22, 2019
So we can add “rule” number 3. (which I updated as well to emphasize the security reasons) to the list.
Finally, that error looked really nasty and I just didn’t want to have that on my screen. :joy:
Well, this escalated into quite a long post. Let me know on Twitter if I should try to keep it shorter or whether this is fine.
I still hope it was interesting for you and you could take something away from this – and if this “something” is that I probably spend too much time on Twitter…you’re right.
Until next time: keep coding. ❤️
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://gitlab.com/friep/blog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".