0

First, to be clear: My desired goal is to scrape data from ~100 URLS monthly using the code below. I need data from each URL to be exported to the same XLSX file but in different sheets with a predetermined name. Example from code below: Workbook name = "data.xlsx", and sheet name = "FEUR". ALSO: All of the links have the exact same layout and XPATHs. Works perfectly to just insert a new link.

The only solution I have found to be working so far is copy-pasting the code from the ####### and down, where I change the URL in driver.get() and the sheet_name="XX" in df.to_excel().

Instead, I am looking for a more efficient code to add links and make the code less heavy. Is this possible to do using Selenium?

See the code below:

from bs4 import BeautifulSoup
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import pandas as pd
from openpyxl import load_workbook

opts = Options()
opts.add_argument(" --headless")

chrome_driver = os.getcwd() +"/chromedriver"


driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)
driver.implicitly_wait(10)

############
#FEUR
driver.get("https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000ZG2F&tab=3")
driver.switch_to.frame(1)
driver.find_element_by_xpath("//button[contains(@class,'show-table')]//span").click()
table = driver.find_elements_by_xpath("//div[contains(@class,'sal-mip-factor-profile__value-table')]/table//tr/th")
header = []
for tab in table:
    header.append(tab.text)
#print(header)
tablebody = driver.find_elements_by_xpath("//div[contains(@class,'sal-mip-factor-profile__value-table')]/table//tbody/tr")
data = []
data.append(header)
for tab in tablebody:
    row = []
    content = tab.find_elements_by_tag_name("td")
    for con in content:
        row.append(con.text)
    data.append(row)
df = pd.DataFrame(data)

path= r'/Users/karlemilthulstrup/Downloads/data.xlsx'
book = load_workbook(path)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
df.to_excel(writer, sheet_name="FEUR")
writer.save()
writer.close()
13
  • I would try to look and see if theres an api to get the data. Can you provide a few of the urls you would be scraping? Commented Aug 31, 2021 at 12:01
  • Hi! I am not really familiar with the use of APIs, but I have tried to look into it. Here are a few examples: morningstar.dk/dk/funds/snapshot/… morningstar.dk/dk/funds/snapshot/… morningstar.dk/dk/funds/snapshot/… morningstar.dk/dk/funds/snapshot/… Commented Aug 31, 2021 at 12:07
  • nice. It looks like there are some apis. Now it's just a matter of seeing if they contain the data you want/need. What exactly are you wanting to extract from each site? Commented Aug 31, 2021 at 12:11
  • In the URL I want to go to "Faktorprofil" and click the top-right button to change the visualisation to a table and scrape the data of that table. I can provide a screenshot if that helps? Commented Aug 31, 2021 at 12:18
  • ya provide image. I dodn't see where Faktoprofil is at Commented Aug 31, 2021 at 12:22

1 Answer 1

2

As long as you have the ID (which if you have the links, you can extract from), then feed those into the api. You may need to tweek a bit to fit your needs, but see how this goes.

import requests
import re
import pandas as pd

auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('C:/Users/karlemilthulstrup/Downloads/data.xlsx', engine='openpyxl')

ids = ['F00000ZG2F', 'F0000025UI', 'F00000Z1MD', 'F00000Z4AE','F00000ZG2F']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}

for api_id in ids:
    payload = {
        'Site': 'dk',
        'FC': '%s' %api_id,
        'IT': 'FO',
        'LANG': 'da-DK',}
    
    response = requests.get(auth, params=payload)
    
    search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
    bearer = 'Bearer ' + search.group(2)
    
    headers.update({'Authorization': bearer})
    
    url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
    jsonData = requests.get(url, headers=headers, params=payload).json()
    
    rows = []
    for k, v in jsonData['factors'].items():
        row = {}
        row['factor'] = k
        
        historicRange = v.pop('historicRange')
        row.update(v)
        
        for each in historicRange:
            row.update(each)
            
            rows.append(row.copy())
        
    
    df = pd.DataFrame(rows)
    sheetName = jsonData['id']
    df.to_excel(writer, sheet_name=sheetName, index=False)
    print('Finished: %s' %sheetName)

writer.save()
writer.close()
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you so much, it works great! The sheets get quite random names, but this shouldn't be a problem, I just have to create a database to link this ID with a more recognisable name. Nevertheless, thank you so much for your help!
no problem. Ya, If you create a lookup that can give you better description, that should work fine. I just don't know what would make sense to you. You could also debug and look and see if there is something unique in the json that would make more sense too.
I created a list now that works great in Excel - thank you anyway though! However, I was wondering if you could help me extracting the data from another table? I am guessing that the code would be approximately the same, but I am unable to see how I can make it work. It is the first table of this link: morningstar.dk/dk/funds/snapshot/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.