3

I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages. I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):

For genus names : //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]

For species names: //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]

For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]

For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]

For confidence levels (image): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: //[contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a

I now want to extract the information into a dataframe/table.

I tried using the xpathSApply function of the XML package to extract some of this data:

e.g. for infraspecies ranks

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

However, this method is problematic because of gaps in the data (e.g. only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.

Does anyone know a better way to extract information from this table into a dataframe?

Any help would be much appreciated!

Tom

1
  • Quick suggestion: read in the full HTML as a character string, then simply apply regular expressions (in my experience, HTML is very susceptible to that). First isolate the part with the table, and then do substructure... Commented Jun 21, 2011 at 15:10

2 Answers 2

5

Here is another solution, which splits each species name into its component parts

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

It produces the following output

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.Löve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L
Sign up to request clarification or add additional context in comments.

2 Comments

When I try to run this code I get the following error: Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "XMLNodeSet"
Make sure to update your versions of the XML and plyr packages. I checked the code and it still works for me.
2

The following code parses your table into a matrix.

Caveats:

  • The confidence level column is blank, since this is not text but an image. If this is important, you should be able to retrieve the image location, and parse that.
  • There are some encoding issues (UTF-8 character get converted into ASCII on my machine). I don't yet know how to fix this.

The code:

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

The results:

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.Löve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"

1 Comment

Hi, thanks very much for the suggestion! Ideally I would like to split the name into each of its component parts, as in Ramnath's example below, but its good to see another way to go about it!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.