For university research I try to scrape an FDA table (robots.txt allows to scrape this content)
The table contains 19 rows and 2 columns: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181
The format I try to extract is:
col1 col2 url_of_col2
<chr> <chr> <chr>
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~
What I achieved:
I can easly extract the items of the first column:
#library
library(tidyverse)
library(xml2)
library(rvest)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
html %>%
html_nodes("table") -> tables
tables[[9]] -> table
# extract col 1 items
table %>%
html_nodes("th") %>%
html_text() %>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "Device Classification Name" "510(k) Number"
#> [3] "Device Name" "Applicant"
#> [5] "Applicant Contact" "Correspondent"
#> [7] "Correspondent Contact" "Regulation Number"
#> [9] "Classification Product Code" "Date Received"
#> [11] "Decision Date" "Decision"
#> [13] "Regulation Medical Specialty" "510k Review Panel"
#> [15] "summary" "Type"
#> [17] "Clinical Trials" "Reviewed by Third Party"
#> [19] "Combination Product"
Created on 2021-02-27 by the reprex package (v1.0.0)
Where I get stuck
- Since some cells of column 2 contain a table, this approach does not give the same number of items:
# extract col 2 items
table %>%
html_nodes("td") %>%
html_text()%>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"
#> [2] "K203181"
#> [3] "Nerivio, FGD000075-4.7"
#> [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#> [5] "Theranica Bioelectronics ltd"
#> [6] "4 Ha-Omanutst. Poleg Industrial Park"
#> [7] "netanya, IL4250574"
#> [8] "alon ironi"
#> [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"
#> [10] "Hogan Lovells US LLP"
#> [11] "1735 Market Street"
#> [12] "Suite 2300"
#> [13] "philadelphia, PA 19103"
#> [14] "janice m. hogan"
#> [15] "882.5899"
#> [16] "QGT "
#> [17] "QGT "
#> [18] "10/26/2020"
#> [19] "01/22/2021"
#> [20] "substantially equivalent (SESE)"
#> [21] "Neurology"
#> [22] "Neurology"
#> [23] "summary"
#> [24] "Traditional"
#> [25] "NCT04089761"
#> [26] "No"
#> [27] "No"
Created on 2021-02-27 by the reprex package (v1.0.0)
- Moreover, I could not find a way to extract the urls of col2
I found a good manual to read html tables with cells spanning on multiple rows. However, I think this approach does not work for nested dataframes.
There is similar question regarding a nested table without links (How to scrape older html with nested tables in R?) which has not been answered yet. A comment suggested this question, unfortunately I could not apply it to my html table.
There is the unpivotr package that aims to read nested html tables, however, I could not solve my problem with that package.