I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages. I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).
library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")
I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):
For genus names : //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]
For species names: //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]
For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]
For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]
For confidence levels (image): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: //[contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a
I now want to extract the information into a dataframe/table.
I tried using the xpathSApply function of the XML package to extract some of this data:
e.g. for infraspecies ranks
library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)
However, this method is problematic because of gaps in the data (e.g. only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.
Does anyone know a better way to extract information from this table into a dataframe?
Any help would be much appreciated!
Tom