0

My Code:

from bs4 import BeautifulSoup as soup
from numpy.lib.function_base import extract
import requests
import pandas as pd

Scraper2Excel = "C:\\Users\\Ashley\\FromPython3.xlsx"

writer = pd.ExcelWriter(Scraper2Excel, engine='xlsxwriter')

READ = "C:\\Users\\Ashley\\URLs List.xlsx"
Tickers1 = pd.read_excel(READ, sheet_name='Tickers', header=None)
Tickers = Tickers1.values.ravel()
print(Tickers)

UniformResourceLocators = pd.read_excel(READ, sheet_name= 'URLs', header=None, skiprows=1)
UniformResourceLocatorsTitles = pd.read_excel(READ, sheet_name='URLs', header=None, nrows=1).values[0]
UniformResourceLocators.columns = UniformResourceLocatorsTitles

URLs = UniformResourceLocators['Company News URL']
tick = UniformResourceLocators['Tickers']

startrow =0

for i in Tickers:
    s = Tickers1.loc[(Tickers1[0]==i)]
    print(s)
    s.to_excel(writer, sheet_name='Sheet1', startrow= startrow, startcol= 0, header=False, index=False)
    startrow += 1

    url = URLs.loc[(tick==i)]
    print(url)
    
    for i in url:
        html_text = requests.get(i).text
        chickennoodle = soup(html_text, 'html.parser')
    
        for link in chickennoodle.find_all('a'):
            my_links = (link.get('href'))
            


            print(my_links)

I get stuck here. my_links prints a bunch of URLs in a string format, and I'm wanting to output them to an excel file. I haven't been able to find a way to convert it to a DataFrame so pandas will let me use to_excel. I'm very novice so thanks for any help.

             
            #df = my_links??
            df.to_excel(writer, sheet_name='Sheet2', startrow= startrow, startcol=0, header=False, index=False)
            startrow += 1


writer.save()    

1 Answer 1

1

What I would do is initiate an empty list before your for loop and then append 'my_links' to that list within your loop.

Then at the end of your code , you can convert that list to a column of your df before exporting to excel. Something like

mylinksList=[]
df=pd.DataFrame()

for i in url:
        html_text = requests.get(i).text
        chickennoodle = soup(html_text, 'html.parser')
    
        for link in chickennoodle.find_all('a'):
            my_links = (link.get('href'))
            mylinksList.append(my_links)



df['links']=pd.Series(mylinksList)
Sign up to request clarification or add additional context in comments.

2 Comments

A nice and fancy addition would be to add links in a string format as following str(f'=HYPERLINK("websitename.com{link_adress}", "{link_text}")')
thanks this helped tons! I ended up doing mylinksList.append(my_links) a = pd.DataFrame(mylinksList) a.to_excel...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.