scrape multiple linked HTML tables in R and rvest

Question

This article http://www.ajnr.org/content/30/7/1402.full contains four links to html-tables which I would like to scrape with rvest.

With help of the css selector:

"#T1 a"

it's possible to get to the first table like this:

library("rvest")
html_session("http://www.ajnr.org/content/30/7/1402.full") %>%
follow_link(css="#T1 a") %>%
html_table() %>%
View()

The css-selector:

".table-inline li:nth-child(1) a"

makes it possible to select all four html-nodes containing the tags linking to the four tables:

library("rvest")
html("http://www.ajnr.org/content/30/7/1402.full") %>%
html_nodes(css=".table-inline li:nth-child(1) a")

How would it be possible to loop through this list and retrieve all four tables in one go? What's the best approach?

maybe this helps you out stackoverflow.com/questions/1395528/… — user1267127
– user1267127, Commented Feb 25, 2015 at 21:46

hadley · Accepted Answer · 2015-09-30 13:17:37Z

19

Here's one approach:

library(rvest)

url <- "http://www.ajnr.org/content/30/7/1402.full"
page <- read_html(url)

# First find all the urls
table_urls <- page %>% 
  html_nodes(".table-inline li:nth-child(1) a") %>%
  html_attr("href") %>%
  xml2::url_absolute(url)

# Then loop over the urls, downloading & extracting the table
lapply(table_urls, . %>% read_html() %>% html_table())

edited Sep 30, 2015 at 13:17

answered Feb 26, 2015 at 12:23

hadley

104k35 gold badges186 silver badges248 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Paul M Over a year ago

I tried this, and I got an error message: Warning message: 'html' is deprecated. Use 'read_html' instead. See help("Deprecated")

Paul M Over a year ago

I changed this by substituting read_html(url) on the third line. I am still getting complaints. What am I doing wrong?

hadley Over a year ago

@PaulM you probably missed the one on the last line

user1267127 · Accepted Answer · 2015-02-25 22:06:40Z

1

You might want to use as follows:

main_url <- "http://www.ajnr.org/content/30/7/1402/"
urls <- paste(main_url,c("T1.expansion","T2.expansion","T3.expansion","T4.expansion"),".html", sep = "")
tables <- list()
for(i in seq_along(urls))
{
  total <- readHTMLTable(urls[i])
  n.rows <- unlist(lapply(total, function(t) dim(t)[1]))
  tables[[i]] <- as.data.frame(total[[which.max(n.rows)]])
}
tables

#[[1]]
#  Glioma Grade Sensitivity Specificity    PPV    NPV
#1    II vs III       50.0%       92.9%  80.0%  76.5%
#2     II vs IV      100.0%      100.0% 100.0% 100.0%
#3    III vs IV       78.9%       87.5%  93.8%  63.6%

#[[2]]
#  Glioma Grade Sensitivity Specificity   PPV    NPV
#1    II vs III       87.5%       71.4% 63.6%  90.9%
#2     II vs IV      100.0%       85.7% 90.5% 100.0%
#3    III vs IV       89.5%       75.0% 89.5%  75.0%

#[[3]]
#  Criterion Sensitivity Specificity    PPV   NPV
#1       ≥1*       85.2%       92.9%  95.8% 76.5%
#2        ≥2       81.5%      100.0% 100.0% 73.7%

#[[4]]
#  Criterion Sensitivity Specificity   PPV   NPV
#1     <1.92       96.3%       71.4% 86.7% 90.9%
#2     <2.02       92.6%       71.4% 86.2% 83.3%
#3    <2.12*       92.6%       85.7% 92.6% 85.7%

answered Feb 25, 2015 at 22:06

user1267127

1 Comment

landge Over a year ago

Thank you! Would it be possible to create a more generic approach where the script extracts the number of tables (or links)? This could then be used for other articles of the same journal as well.

Collectives™ on Stack Overflow

scrape multiple linked HTML tables in R and rvest

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related