Webscraping from multiple similar links with Selenium Webscraper

Question

First, to be clear: My desired goal is to scrape data from ~100 URLS monthly using the code below. I need data from each URL to be exported to the same XLSX file but in different sheets with a predetermined name. Example from code below: Workbook name = "data.xlsx", and sheet name = "FEUR". ALSO: All of the links have the exact same layout and XPATHs. Works perfectly to just insert a new link.

The only solution I have found to be working so far is copy-pasting the code from the ####### and down, where I change the URL in driver.get() and the sheet_name="XX" in df.to_excel().

Instead, I am looking for a more efficient code to add links and make the code less heavy. Is this possible to do using Selenium?

See the code below:

from bs4 import BeautifulSoup
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import pandas as pd
from openpyxl import load_workbook

opts = Options()
opts.add_argument(" --headless")

chrome_driver = os.getcwd() +"/chromedriver"


driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)
driver.implicitly_wait(10)

############
#FEUR
driver.get("https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000ZG2F&tab=3")
driver.switch_to.frame(1)
driver.find_element_by_xpath("//button[contains(@class,'show-table')]//span").click()
table = driver.find_elements_by_xpath("//div[contains(@class,'sal-mip-factor-profile__value-table')]/table//tr/th")
header = []
for tab in table:
    header.append(tab.text)
#print(header)
tablebody = driver.find_elements_by_xpath("//div[contains(@class,'sal-mip-factor-profile__value-table')]/table//tbody/tr")
data = []
data.append(header)
for tab in tablebody:
    row = []
    content = tab.find_elements_by_tag_name("td")
    for con in content:
        row.append(con.text)
    data.append(row)
df = pd.DataFrame(data)

path= r'/Users/karlemilthulstrup/Downloads/data.xlsx'
book = load_workbook(path)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
df.to_excel(writer, sheet_name="FEUR")
writer.save()
writer.close()

I would try to look and see if theres an api to get the data. Can you provide a few of the urls you would be scraping? — chitown88
– chitown88, Commented Aug 31, 2021 at 12:01
Hi! I am not really familiar with the use of APIs, but I have tried to look into it. Here are a few examples: morningstar.dk/dk/funds/snapshot/… morningstar.dk/dk/funds/snapshot/… morningstar.dk/dk/funds/snapshot/… morningstar.dk/dk/funds/snapshot/… — Karl Emil Thulstrup
– Karl Emil Thulstrup, Commented Aug 31, 2021 at 12:07
nice. It looks like there are some apis. Now it's just a matter of seeing if they contain the data you want/need. What exactly are you wanting to extract from each site? — chitown88
– chitown88, Commented Aug 31, 2021 at 12:11
In the URL I want to go to "Faktorprofil" and click the top-right button to change the visualisation to a table and scrape the data of that table. I can provide a screenshot if that helps? — Karl Emil Thulstrup
– Karl Emil Thulstrup, Commented Aug 31, 2021 at 12:18

chitown88 · Accepted Answer · 2021-08-31 13:21:15Z

2

As long as you have the ID (which if you have the links, you can extract from), then feed those into the api. You may need to tweek a bit to fit your needs, but see how this goes.

import requests
import re
import pandas as pd

auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('C:/Users/karlemilthulstrup/Downloads/data.xlsx', engine='openpyxl')

ids = ['F00000ZG2F', 'F0000025UI', 'F00000Z1MD', 'F00000Z4AE','F00000ZG2F']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}

for api_id in ids:
    payload = {
        'Site': 'dk',
        'FC': '%s' %api_id,
        'IT': 'FO',
        'LANG': 'da-DK',}
    
    response = requests.get(auth, params=payload)
    
    search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
    bearer = 'Bearer ' + search.group(2)
    
    headers.update({'Authorization': bearer})
    
    url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
    jsonData = requests.get(url, headers=headers, params=payload).json()
    
    rows = []
    for k, v in jsonData['factors'].items():
        row = {}
        row['factor'] = k
        
        historicRange = v.pop('historicRange')
        row.update(v)
        
        for each in historicRange:
            row.update(each)
            
            rows.append(row.copy())
        
    
    df = pd.DataFrame(rows)
    sheetName = jsonData['id']
    df.to_excel(writer, sheet_name=sheetName, index=False)
    print('Finished: %s' %sheetName)

writer.save()
writer.close()

answered Aug 31, 2021 at 13:21

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Karl Emil Thulstrup Over a year ago

Thank you so much, it works great! The sheets get quite random names, but this shouldn't be a problem, I just have to create a database to link this ID with a more recognisable name. Nevertheless, thank you so much for your help!

chitown88 Over a year ago

no problem. Ya, If you create a lookup that can give you better description, that should work fine. I just don't know what would make sense to you. You could also debug and look and see if there is something unique in the json that would make more sense too.

Karl Emil Thulstrup Over a year ago

I created a list now that works great in Excel - thank you anyway though! However, I was wondering if you could help me extracting the data from another table? I am guessing that the code would be approximately the same, but I am unable to see how I can make it work. It is the first table of this link: morningstar.dk/dk/funds/snapshot/…

Collectives™ on Stack Overflow

Webscraping from multiple similar links with Selenium Webscraper

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related