Parsing HTML tables using the XML / RCurl R packages, without using the readHTMLTable function

Question

I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages. I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):

For genus names : //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]

For species names: //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]

For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]

For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]

For confidence levels (image): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: //[contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a

I now want to extract the information into a dataframe/table.

I tried using the xpathSApply function of the XML package to extract some of this data:

e.g. for infraspecies ranks

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

However, this method is problematic because of gaps in the data (e.g. only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.

Does anyone know a better way to extract information from this table into a dataframe?

Any help would be much appreciated!

Tom

Quick suggestion: read in the full HTML as a character string, then simply apply regular expressions (in my experience, HTML is very susceptible to that). First isolate the part with the table, and then do substructure... — Nick Sabbe
– Nick Sabbe, Commented Jun 21, 2011 at 15:10

Ramnath · Accepted Answer · 2011-06-21 16:59:50Z

5

Here is another solution, which splits each species name into its component parts

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

It produces the following output

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.LÃ¶ve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L

answered Jun 21, 2011 at 16:59

Ramnath

55.9k16 gold badges129 silver badges155 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rrs Over a year ago

When I try to run this code I get the following error: Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "XMLNodeSet"

Ramnath Over a year ago

Make sure to update your versions of the XML and plyr packages. I checked the code and it still works for me.

Andrie · Accepted Answer · 2011-06-21 15:13:59Z

2

The following code parses your table into a matrix.

Caveats:

The confidence level column is blank, since this is not text but an image. If this is important, you should be able to retrieve the image location, and parse that.
There are some encoding issues (UTF-8 character get converted into ASCII on my machine). I don't yet know how to fix this.

The code:

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

The results:

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.LÃ¶ve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"

answered Jun 21, 2011 at 15:13

Andrie

180k52 gold badges456 silver badges504 bronze badges

1 Comment

tom Over a year ago

Hi, thanks very much for the suggestion! Ideally I would like to split the name into each of its component parts, as in Ramnath's example below, but its good to see another way to go about it!

Collectives™ on Stack Overflow

Parsing HTML tables using the XML / RCurl R packages, without using the readHTMLTable function

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related