R: scrape nested html table with links (table within cell)

Question

For university research I try to scrape an FDA table (robots.txt allows to scrape this content)

The table contains 19 rows and 2 columns: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181

The format I try to extract is:

col1                 col2                                                    url_of_col2                                                                   
  <chr>                <chr>                                                   <chr>                                                                         
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~

What I achieved:

I can easly extract the items of the first column:

#library
library(tidyverse)
library(xml2)
library(rvest)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
html %>% 
  html_nodes("table") -> tables
tables[[9]] -> table

# extract col 1 items
table %>%
  html_nodes("th") %>% 
  html_text() %>%
  gsub("\n|\t|\r","",.) %>% 
  trimws()
#>  [1] "Device Classification Name"   "510(k) Number"               
#>  [3] "Device Name"                  "Applicant"                   
#>  [5] "Applicant Contact"            "Correspondent"               
#>  [7] "Correspondent Contact"        "Regulation Number"           
#>  [9] "Classification Product Code"  "Date Received"               
#> [11] "Decision Date"                "Decision"                    
#> [13] "Regulation Medical Specialty" "510k Review Panel"           
#> [15] "summary"                      "Type"                        
#> [17] "Clinical Trials"              "Reviewed by Third Party"     
#> [19] "Combination Product"

^{Created on 2021-02-27 by the reprex package (v1.0.0)}

Where I get stuck

Since some cells of column 2 contain a table, this approach does not give the same number of items:

# extract col 2 items
table %>%
  html_nodes("td") %>% 
  html_text()%>%
  gsub("\n|\t|\r","",.) %>% 
  trimws()
#>  [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"       
#>  [2] "K203181"                                                                           
#>  [3] "Nerivio, FGD000075-4.7"                                                            
#>  [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#>  [5] "Theranica Bioelectronics ltd"                                                      
#>  [6] "4 Ha-Omanutst. Poleg Industrial Park"                                              
#>  [7] "netanya, IL4250574"                                                                
#>  [8] "alon  ironi"                                                                       
#>  [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"            
#> [10] "Hogan Lovells US LLP"                                                              
#> [11] "1735 Market Street"                                                                
#> [12] "Suite 2300"                                                                        
#> [13] "philadelphia, PA 19103"                                                            
#> [14] "janice m. hogan"                                                                   
#> [15] "882.5899"                                                                          
#> [16] "QGT  "                                                                             
#> [17] "QGT  "                                                                             
#> [18] "10/26/2020"                                                                        
#> [19] "01/22/2021"                                                                        
#> [20] "substantially equivalent (SESE)"                                                   
#> [21] "Neurology"                                                                         
#> [22] "Neurology"                                                                         
#> [23] "summary"                                                                           
#> [24] "Traditional"                                                                       
#> [25] "NCT04089761"                                                                       
#> [26] "No"                                                                                
#> [27] "No"

^{Created on 2021-02-27 by the reprex package (v1.0.0)}

Moreover, I could not find a way to extract the urls of col2

I found a good manual to read html tables with cells spanning on multiple rows. However, I think this approach does not work for nested dataframes.

There is similar question regarding a nested table without links (How to scrape older html with nested tables in R?) which has not been answered yet. A comment suggested this question, unfortunately I could not apply it to my html table.

There is the unpivotr package that aims to read nested html tables, however, I could not solve my problem with that package.

Dave2e · Accepted Answer · 2021-02-27 04:19:48Z

2

Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.

library(rvest)
library(stringr)
library(dplyr)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
tables <- html %>%  html_nodes("table") 
table <- tables[[9]] 


#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>%  html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>%  html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href") 

answer <-data.frame(leftside, rightside, links)

One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.
The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.

Update
To reduce the answer down to the desired 19 original rows:

library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")

#Create the final results
finalanswer <- answer %>% group_by(leftside) %>% 
                summarize(info=paste(rightside, collapse = " "), link=first(links))

edited Feb 27, 2021 at 4:19

answered Feb 27, 2021 at 2:05

Dave2e

24.3k18 gold badges46 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ava Over a year ago

This is marvellous, thank you. Just for my understanding, with tidyr::fill() I can remove the NAs of the leftside and merge the corresponding values of the rightside so that I get a data.frame with 19 rows, right?

Collectives™ on Stack Overflow

R: scrape nested html table with links (table within cell)

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related