Scraping html tables into R data frames using the XML package

Question

How do I scrape html tables using the XML package?

Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the "list of all matches Brazil have played against FIFA recognised teams" table as a data.frame. How can I do this?

To work out the xpath selectors, check out selectorgadget.com/ - it's awesome — hadley
– hadley, Commented Sep 9, 2009 at 22:55

Jim G. · Accepted Answer · 2017-04-08 13:21:04Z

152

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]

edited Apr 8, 2017 at 13:21

Jim G.

15.4k23 gold badges109 silver badges183 bronze badges

answered Dec 4, 2009 at 20:14

user225056

Sign up to request clarification or add additional context in comments.

1 Comment

Dave X Over a year ago

The readHTMLTable help also provides an example of reading a plain text table out of an HTML PRE element using htmlParse(), getNodeSet(), textConnection() and read.table()

Richie Cotton · Accepted Answer · 2009-09-09 13:00:06Z

library(RCurl)
library(XML)

# Download page using RCurl
# You may need to set proxy details, etc.,  in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})

# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body 
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]

#In this case, the required table is the only one with class "wikitable sortable"  
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable  <- tables[which(tableclasses=="wikitable sortable")]$table

#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))

# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
   tablerow <- thetable$children[[i]]$children
   opponent <- tablerow[[1]]$children[[2]]$children$text$value
   others <- unname(sapply(tablerow[-1], function(x) x$children$text$value)) 
   content <- rbind(content, c(opponent, others))
}

# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)

Edited to add:

Sample output

                     Opponent Played Won Drawn Lost Goals for Goals against  % Won
    1               Argentina     94  36    24   34       148           150  38.3%
    2                Paraguay     72  44    17   11       160            61  61.1%
    3                 Uruguay     72  33    19   20       127            93  45.8%
    ...

For anyone else who is fortunate enough to find this post, this script will likely not execute unless the user adds their "User-Agent" information, as described in this other helpful post: stackoverflow.com/questions/9056705/…

Dave2e · Accepted Answer · 2018-10-22 15:41:33Z

31

The rvest along with xml2 is another popular package for parsing html web pages.

library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)

The syntax is easier to use than the xml package and for most web pages the package provides all of the options ones needs.

edited Oct 22, 2018 at 15:41

answered May 13, 2016 at 0:55

Dave2e

24.3k18 gold badges46 silver badges57 bronze badges

1 Comment

scs Over a year ago

The read_html gives me the error "'file:///Users/grieb/Auswertungen/tetyana-snp-2016/data/snp-nexus/15/SNP%20Annotation%20Tool.html' does not exist in current working directory ('/Users/grieb/Auswertungen/tetyana-snp-2016/code')."

learnr · Accepted Answer · 2009-09-09 22:01:45Z

28

Another option using Xpath.

library(RCurl)
library(XML)

theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)

# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))

# Clean up the results
content[,1] <- gsub("Â ", "", content[,1])
tablehead <- gsub("Â ", "", tablehead)
names(content) <- tablehead

Produces this result

> head(content)
   Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina     94  36    24   34       148           150 38.3%
2  Paraguay     72  44    17   11       160            61 61.1%
3   Uruguay     72  33    19   20       127            93 45.8%
4     Chile     64  45    12    7       147            53 70.3%
5      Peru     39  27     9    3        83            27 69.2%
6    Mexico     36  21     6    9        69            34 58.3%

edited Sep 9, 2009 at 22:01

answered Sep 9, 2009 at 18:43

learnr

6,6794 gold badges30 silver badges23 bronze badges

3 Comments

Richie Cotton Over a year ago

Excellent call on using xpath. Minor point: you can slightly simplify the path argument by changing //*/ to //, e.g. "//table[@class='wikitable sortable']/tr/th"

pssguy Over a year ago

I get an error "Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice." [2] " Is there a way round this to implement this method?

learnr Over a year ago

options(RCurlOptions = list(useragent = "zzzz")). See also omegahat.org/RCurl/FAQ.html section "Runtime" for other alternatives and discussion.

Collectives™ on Stack Overflow

Scraping html tables into R data frames using the XML package

4 Answers 4

1 Comment

1 Comment

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

1 Comment

3 Comments

Linked

Related