Scraping table using html_table in R

Question

I want to scrape the Sector Weightings Table from the following link:

http://portfolios.morningstar.com/fund/summary?t=SPY&region=usa&culture=en-US&ownerCountry=USA

The table i want is table 6 in the website's source code. I have the following script written in R:

 library(rvest)
 turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
 turlr = read_html(turl) 
 df6<-html_table(html_nodes(turlr, 'table')[[6]], fill = TRUE)

However when i run the last line of the script i get the following error message

Error in out[j + k, ] : subscript out of bounds

You should see How to create a Minimal, Complete, and Verifiable example — user10089632
– user10089632, Commented Nov 26, 2017 at 23:44
Precisely you didn't include the important code that had produced this error — user10089632
– user10089632, Commented Nov 26, 2017 at 23:47
There are embedded charts and groupings in your target table. You will need to alter the returned node before it will be accepted by html_table. See this question for some guidance. — Kevin Arseneau
– Kevin Arseneau, Commented Nov 27, 2017 at 0:10
There are nigh countless R + scraping + morningstar posts on SO. Which ones did not have info that could have helped you? I'm constantly mystified abt this since it take more energy to create a q than to do an actual search. — hrbrmstr
– hrbrmstr, Commented Nov 27, 2017 at 1:25

Prem · Accepted Answer · 2017-11-27 07:17:43Z

Since the required table is designed in a different way rvest is not able to format it into proper table. But using XML package you can do it quite easily.

library(XML)
library(dplyr)

#read required table
turl = 'http://portfolios.morningstar.com/fund/summary?t=SPY'
temp_table <- readHTMLTable(turl)[[6]]

#process table to readable format
final_table <- temp_table %>%
  select(V2, V3, V4, V5) %>%
  na.omit() %>%
  `colnames<-` (c("","% Stocks","Benchmark","Category Avg")) %>%
  `rownames<-` (seq_len(nrow(.)))
final_table

Output is:

                          % Stocks Benchmark Category Avg
1                Cyclical                                
2         Basic Materials     2.79      3.16         3.22
3       Consumer Cyclical    11.06     11.42        11.15
4      Financial Services    16.39     16.50        17.22
5             Real Estate     2.24      3.18         2.00
6               Sensitive                                
7  Communication Services     3.56      3.37         3.50
8                  Energy     5.83      5.79         5.79
9             Industrials    10.37     10.89        11.70
10             Technology    22.16     21.41        19.72
11              Defensive                                
12     Consumer Defensive     8.20      7.60         8.56
13             Healthcare    14.24     13.57        14.57
14              Utilities     3.15      3.11         2.59

Hope it helps!

Collectives™ on Stack Overflow

Scraping table using html_table in R

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related