1

For university research I try to scrape an FDA table (robots.txt allows to scrape this content)

The table contains 19 rows and 2 columns: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181

The format I try to extract is:

col1                 col2                                                    url_of_col2                                                                   
  <chr>                <chr>                                                   <chr>                                                                         
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~

What I achieved:

I can easly extract the items of the first column:

#library
library(tidyverse)
library(xml2)
library(rvest)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
html %>% 
  html_nodes("table") -> tables
tables[[9]] -> table

# extract col 1 items
table %>%
  html_nodes("th") %>% 
  html_text() %>%
  gsub("\n|\t|\r","",.) %>% 
  trimws()
#>  [1] "Device Classification Name"   "510(k) Number"               
#>  [3] "Device Name"                  "Applicant"                   
#>  [5] "Applicant Contact"            "Correspondent"               
#>  [7] "Correspondent Contact"        "Regulation Number"           
#>  [9] "Classification Product Code"  "Date Received"               
#> [11] "Decision Date"                "Decision"                    
#> [13] "Regulation Medical Specialty" "510k Review Panel"           
#> [15] "summary"                      "Type"                        
#> [17] "Clinical Trials"              "Reviewed by Third Party"     
#> [19] "Combination Product"

Created on 2021-02-27 by the reprex package (v1.0.0)

Where I get stuck

  1. Since some cells of column 2 contain a table, this approach does not give the same number of items:
# extract col 2 items
table %>%
  html_nodes("td") %>% 
  html_text()%>%
  gsub("\n|\t|\r","",.) %>% 
  trimws()
#>  [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"       
#>  [2] "K203181"                                                                           
#>  [3] "Nerivio, FGD000075-4.7"                                                            
#>  [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#>  [5] "Theranica Bioelectronics ltd"                                                      
#>  [6] "4 Ha-Omanutst. Poleg Industrial Park"                                              
#>  [7] "netanya, IL4250574"                                                                
#>  [8] "alon  ironi"                                                                       
#>  [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"            
#> [10] "Hogan Lovells US LLP"                                                              
#> [11] "1735 Market Street"                                                                
#> [12] "Suite 2300"                                                                        
#> [13] "philadelphia, PA 19103"                                                            
#> [14] "janice m. hogan"                                                                   
#> [15] "882.5899"                                                                          
#> [16] "QGT  "                                                                             
#> [17] "QGT  "                                                                             
#> [18] "10/26/2020"                                                                        
#> [19] "01/22/2021"                                                                        
#> [20] "substantially equivalent (SESE)"                                                   
#> [21] "Neurology"                                                                         
#> [22] "Neurology"                                                                         
#> [23] "summary"                                                                           
#> [24] "Traditional"                                                                       
#> [25] "NCT04089761"                                                                       
#> [26] "No"                                                                                
#> [27] "No"

Created on 2021-02-27 by the reprex package (v1.0.0)

  1. Moreover, I could not find a way to extract the urls of col2

I found a good manual to read html tables with cells spanning on multiple rows. However, I think this approach does not work for nested dataframes.

There is similar question regarding a nested table without links (How to scrape older html with nested tables in R?) which has not been answered yet. A comment suggested this question, unfortunately I could not apply it to my html table.

There is the unpivotr package that aims to read nested html tables, however, I could not solve my problem with that package.

1 Answer 1

2

Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.

library(rvest)
library(stringr)
library(dplyr)

#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")

# select table of interest
tables <- html %>%  html_nodes("table") 
table <- tables[[9]] 


#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>%  html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>%  html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href") 

answer <-data.frame(leftside, rightside, links)

One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.
The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.

Update
To reduce the answer down to the desired 19 original rows:

library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")

#Create the final results
finalanswer <- answer %>% group_by(leftside) %>% 
                summarize(info=paste(rightside, collapse = " "), link=first(links))
Sign up to request clarification or add additional context in comments.

1 Comment

This is marvellous, thank you. Just for my understanding, with tidyr::fill() I can remove the NAs of the leftside and merge the corresponding values of the rightside so that I get a data.frame with 19 rows, right?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.