Optimize web scraping with Rselenium

Question

I am doing some web scraping on a dynamic webpage and would like to optimize the process since it is very slow. The webpage displays a series of sales with information and as one scrolls down more sales show up, although there is a finite number of sales. What I did is to increase the window size so it would load almost every sale without scrolling. However, this takes a while to load since there is a lot of information, and images. The information that I am extracting is the price, the asset name, and the link associated with the asset (when you click on the image).

My goal is to optimize this process as much as possible. One way to do so would be not to load the images since I don't need them, but I could not find a way to do so with Firefox.

Any improvement would be greatly appreciated.

library(RSelenium)
library(rvest)

url <- "https://cnft.io/marketplace?project=Boss%20Cat%20Rocket%20Club&sort=_id:-1&type=listing,offer"

exCap <- list("moz:firefoxOptions" = list(args = list('--headless'))) # Hide browser --headless
rD <- rsDriver(browser = "firefox", port = as.integer(sample(4000:4700, 1)),
               verbose = FALSE, extraCapabilities = exCap)
remDr <- rD[["client"]]
remDr$setWindowSize(30000, 30000)
remDr$navigate(url)
Sys.sleep(300)
html <- remDr$getPageSource()[[1]]
remDr$close()

html <- read_html(html)

Dave2e · Accepted Answer · 2022-10-07 23:15:29Z

2

Well, after some digging through that website, I found an API for all the listings: https://api.cnft.io/market/listings. It takes a POST request and will return paginated JSON strings. We can use httr to send such requests.
Here is a small script for your web scraping task.

api_link <- "https://api.cnft.io/market/listings"
project <- "Boss Cat Rocket Club"

query <- function(page, url, project) {
  httr::content(httr::POST(
    url = url, 
    body = list(
      search = "", 
      types = c("listing", "offer"), 
      project = project, 
      sort = list(`_id` = -1L), 
      priceMin = NULL, 
      priceMax = NULL, 
      page = page, 
      verified = TRUE, 
      nsfw = FALSE, 
      sold = FALSE, 
      smartContract = FALSE
    ), 
    encode = "json"
  ), simplifyVector = TRUE)
}

query_all <- function(url, project) {
  n <- query(1L, url, project)[["count"]]
  out <- vector("list", n)
  for (i in seq_len(n)) {
    out[[i]] <- query(i, url, project)[["results"]]
    if (length(out[[i]]) < 1L)
      return(out[seq_len(i - 1L)])
  }
  out
}

collect_data <- function(results) {
  dplyr::tibble(
    asset_id = results[["asset"]][["assetId"]],
    price = results[["price"]],
    link = paste0("https://cnft.io/token/", results[["_id"]])
  )
}

system.time(
  dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()  
)
dt

Output (it takes about 12 seconds to finish)

> system.time(
+   dt <- query_all(api_link, project) |> lapply(collect_data) |> dplyr::bind_rows()  
+ )
   user  system elapsed 
   0.78    0.00   12.33 
> dt
# A tibble: 2,161 x 3
   asset_id                     price link                                          
   <chr>                        <dbl> <chr>                                         
 1 BossCatRocketClub1373    222000000 https://cnft.io/token/61ce22eb4185f57d50190079
 2 BossCatRocketClub4639    380000000 https://cnft.io/token/61ce229b9163f2db80db98fe
 3 BossCatRocketClub5598    505000000 https://cnft.io/token/61ce22954185f57d5018e2ff
 4 BossCatRocketClub2673    187000000 https://cnft.io/token/61ce2281ceed93ea12ae32ec
 5 BossCatRocketClub1721    350000000 https://cnft.io/token/61ce2281398627cc52c5844c
 6 BossCatRocketClub673     300000000 https://cnft.io/token/61ce22724185f57d5018d645
 7 BossCatRocketClub5915 200000000000 https://cnft.io/token/61ce2241398627cc52c56eae
 8 BossCatRocketClub5699    350000000 https://cnft.io/token/61ce21fa398627cc52c55644
 9 BossCatRocketClub4570    350000000 https://cnft.io/token/61ce21ef4185f57d5018a9d4
10 BossCatRocketClub6125    250000000 https://cnft.io/token/61ce21e49163f2db80db58dd
# ... with 2,151 more rows

edited Oct 7, 2022 at 23:15

Dave2e

24.3k18 gold badges46 silver badges57 bronze badges

answered Dec 30, 2021 at 21:25

ekoam

9,5091 gold badge11 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

mat Over a year ago

Wow! Thank you so much, this is amazing! I have been looking for the api for quite some time but couldn't find it. How did you find it?

ekoam Over a year ago

@mat Inspect network activities using Chome/Firefox/Edge. See this.

mat Over a year ago

Thanks a bunch, I'll have a look at it! By any chance, could you quickly figure out if this webpage (jpg.store) also has a 'hidden' api? I am performing the same web scraping on that website.

ekoam Over a year ago

Sorry @mat, this follow-up request is beyond the scope of this post. Also, it is not really an R-related question.

mat Over a year ago

Totally understand. Thanks again for your help!

Collectives™ on Stack Overflow

Optimize web scraping with Rselenium

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related