Download pdf from javascript onclick attribute using R

Question

I would like to download a pdf from this website using R. The problem is that you first have to click on the "Maak een pdf" button on the website. Because this is an javascript onclick attribute. I'm able to find the attribute but I have no idea how to download this pdf file. Here is an screenshot of the element inspection:

Here is the code I tried:

library(tidyverse)
library(rvest)

link = "https://puc.overheid.nl/natuurvergunningen/doc/PUC_746615_17/1/"

button <- link %>%
  read_html() %>%
  html_nodes(".download-als") %>%
  html_nodes("a") %>%
  html_attr("href") 
button
#> [1] "javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00\", \"\", true, \"\", \"\", false, true))"

download.file(button, destfile = "Downloads/test.pdf")
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): URL
#> javascript:WebForm_DoPostBackWithOptions(new
#> WebForm_PostBackOptions("ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00", "",
#> true, "", "", false, true)): cannot open destfile 'Downloads/test.pdf', reason
#> 'No such file or directory'
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): download had
#> nonzero exit status

^{Created on 2024-02-05 with reprex v2.0.2}

I tried to download.file the file but of course that doesn't work. It seems that we need to use the RSelenium to create a click action on the button via a browser. I found this question: How to web-scrape on-click information with R? but I can't find a way to do this with an "onclick" attribute. So I was wondering if anyone knows how to download a pdf file from an onclick attribute?

Hi @KJ, thanks for your comment! But how do you create that link without manually clicking on the button on the website? I'm not completely sure what you mean by set of instructions. — Quinten
– Quinten, Commented Feb 5, 2024 at 12:35
@Quinten You can see my deleted answer (I do not have the time to correct the mess up, hence, it's deleted). Just inspect the Network calls. — MendelG
– MendelG, Commented Feb 5, 2024 at 12:40
Hi @MendelG, Thanks for your suggestion! I don't understand how you get the json result as output? Where can we find that to get the url you mention? — Quinten
– Quinten, Commented Feb 5, 2024 at 13:14
@Quinten You need to inspect your network calls (usually you can press F12) in your browser. and then navigate to the "Network" tab. — MendelG
– MendelG, Commented Feb 5, 2024 at 13:17

margusl · Accepted Answer · 2024-02-06 09:54:50Z

2

To get to that final download link from the document page, we need to play some request/response ping-pong to mimic javascript application -- first, we'd need to submit a request to the backend, then wait for it to finish and continue with the download.

To recover that exact flow and used endpoint (/PUC/Handlers/ManifestatieService.ashx), we should focus on Network tab of browser's dev tools (activate it before clicking through download process to record all relevant requests/responses); if there's too much traffic, search and filter can be quite handy:

To implement flow that's close enough, we'll mostly rely on httr2; rvest is only used to extract JavaScript function parameters from link's onclick attribute. Though in this particular case, we could probably extract identifier PUC_746615_17 and kanaal value (natuurvergunningen) directly from document URL too.

library(tidyverse)
library(rvest)
library(httr2)

# timestamp helper
timestamp_ <- \() sprintf("%.0f", as.numeric(Sys.time()) * 1000)

# get request parameters --------------------------------------------------
link = "https://puc.overheid.nl/natuurvergunningen/doc/PUC_746615_17/1/"

onclick <- 
  link %>%
  read_html() %>%
  html_elements(".download-als a") %>% 
  html_attr("onclick")

(req_param <- str_extract_all(onclick, "(?<=')[^\\s']+(?=')")[[1]])
#> [1] "PUC_746615_17_1"    "natuurvergunningen" "pdf"

# submit request / get ticket ---------------------------------------------
ticket <- 
  request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
  req_url_query(actie      = "maakmanifestatie",
                kanaal     = req_param[2],
                identifier = req_param[1],
                soort      = req_param[3],
                `_`        = timestamp_()) %>% 
  req_perform() %>% 
  resp_body_json(check_type = FALSE)

jsonlite::toJSON(ticket, auto_unbox = TRUE,  pretty = TRUE)
#> {
#>   "ticket": "70337706-d27d-463e-8b6b-8ca2ba47662d"
#> }

# submit ticket / get url -------------------------------------------------
# it takes few moments for backend to finish our request
Sys.sleep(2)
pdf_url <- 
  request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
  req_url_query(actie      = "haalstatus",
                ticket     = ticket$ticket,
                `_`        = timestamp_()) %>% 
  req_perform() %>% 
  resp_body_json(check_type = FALSE)

jsonlite::toJSON(pdf_url, auto_unbox = TRUE,  pretty = TRUE)
#> {
#>   "result": {
#>     "status": "available",
#>     "url": "/puc-opendata/request-result/70337706-d27d-463e-8b6b-8ca2ba47662d/Verlenging%20van%20de%20looptijd%20van%20de%20vergunning%20Wet%20Natuurbescherming%20%28Wnb%29%20voor%20het%20project%20Afsluitdij.pdf",
#>     "filename": "Verlenging van de looptijd van de vergunning Wet Natuurbescherming (Wnb) voor het project Afsluitdij.pdf"
#>   }
#> }

# download pdf ------------------------------------------------------------
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
  req_url_query(actie = "download",
                identifier = req_param[1],
                url = pdf_url$result$url,
                filename = pdf_url$result$filename) %>% 
  req_perform(path = pdf_url$result$filename)
#> <httr2_response>
#> GET
#> https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx?actie=download&identifier=PUC_746615_17_1&url=%2Fpuc-opendata%2Frequest-result%2F70337706-d27d-463e-8b6b-8ca2ba47662d%2FVerlenging%2520van%2520de%2520looptijd%2520van%2520de%2520vergunning%2520Wet%2520Natuurbescherming%2520%2528Wnb%2529%2520voor%2520het%2520project%2520Afsluitdij.pdf&filename=Verlenging%20van%20de%20looptijd%20van%20de%20vergunning%20Wet%20Natuurbescherming%20%28Wnb%29%20voor%20het%20project%20Afsluitdij.pdf
#> Status: 200 OK
#> Content-Type: application/pdf
#> Body: On disk 'body'

fs::file_info(pdf_url$result$filename)[1:3]
#> # A tibble: 1 × 3
#>   path                                                               type   size
#>   <fs::path>                                                         <fct> <fs:>
#> 1 …nning Wet Natuurbescherming (Wnb) voor het project Afsluitdij.pdf file   171K

^{Created on 2024-02-05 with reprex v2.0.2}

Alternative approaches would be based on tools that can handle JavaScript, i.e. Chromote or RSelenium, for example. And perhaps webdriver with PhantomJS.

edited Feb 6, 2024 at 9:54

answered Feb 5, 2024 at 18:59

margusl

21.3k3 gold badges23 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Quinten Over a year ago

Thank you @margusl for your great answer! I am not sure how you got this specific "https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx"url for the request? Is it possible to do this without manually adding to the request. I have to scrape more documents so manually adding this url is not most useful.

margusl Over a year ago

Updated with a screenshot and some explanaition, its basically the same approach described by @MendelG - recover requests through network tab. /PUC/Handlers/ManifestatieService.ashx is a service endpoint and it's highly unlikely that it will change, you just call it with different actions (?actie=...) and arguments.

margusl Over a year ago

And to clarify, this is a bit special case as you have to go through multiple requests and work with intermediate results as there's no static download link for the pdf document, you first need to ask the backend service to prepare a download for you, which is then available only for a limited time.

margusl Over a year ago

And techincally you can extract used endpoints from the page source as well, if you search for manifestatieServiceUrl or manifestatieDownloadServiceUrl, you'll find the relevant javascript block.

Quinten Over a year ago

@margusl, thanks for your great answer! I just asked a follow up question.

|

Collectives™ on Stack Overflow

Download pdf from javascript onclick attribute using R

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related