Creating a Data Frame from the URL of a website using python

Question

I have been working on web scraping from python and I want to create a dataframe from a URL of a website. The data format of the file is .ods. I have tried downloading the .ods file to the computer using beautifulsoup and then reading it to create a dataframe. The file itself contains a header that has to be removed. I achieved successful result through this method and my code is attached below.

from pandas_ods_reader import read_ods
import bs4            
import requests
import pandas as pd

url = "https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic"
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text, "html.parser")
i=0
for link in soup.find_all('a', href=True):
    i+=1
    href = link['href']

    if any(href.endswith(x) for x in ['.ods']):
        #print(href)
        file_data = requests.get(href).content
        with open('data.ods', "wb") as file:
            file.write(file_data)

df = read_ods('data.ods', 1, headers=False)[6:-44]
df.index = range(0, 346)
df.columns = df.iloc[0]
df.drop(0)
df

Now I want to figure out whether this can be achieved directly without downloading the .ods file. If there is a way to directly create a dataframe from the .ods file available in the webpage, that would serve my purpose. Please suggest a suitable code if this is achievable

Shorten beginning section to just soup.select_one('.thumbnail')['href'] — QHarr
– QHarr, Commented Feb 14, 2021 at 20:02

RJ Adriaansen · Accepted Answer · 2021-02-14 18:14:42Z

1

Panda's read_excel allows you to load dataframes directly from urls. You can also use it to load odf files after installing pip install odfpy:

import bs4            
import requests
import pandas as pd

url = "https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic"
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text, "html.parser")
i=0
for link in soup.find_all('a', href=True):
    i+=1
    href = link['href']
    #print(href)
    if any(href.endswith(x) for x in ['.ods']):
        print(href)
        df = pd.read_excel(href, header=None)[6:-44]
        df.index = range(0, 346)
        df.columns = df.iloc[0]
        df.drop(0)
df

answered Feb 14, 2021 at 18:14

RJ Adriaansen

9,7092 gold badges16 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

python_coder_ Over a year ago

Thanks for your answer. It results XLRDError: Openoffice.org ODS file; not supported on the line df = pd.read_excel(href, header=None)[6:-44]

python_coder_ Over a year ago

I installed the odfpy library and tried with df = pd.read_excel(href, header=None, engine='odf')[6:-44]. But this one raised a ValueError: Unknown engine: odf

RJ Adriaansen Over a year ago

This is probably related to this issue. Try downgrading xlrd.

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2021-02-14 20:10:01Z

1

I've no idea why you pickup all anchor tags ? a, BTW, I'll imagine that you will apply the same logic over multiple pages where there can be one or more of files.

So the right approach is to append to set where there can't be any duplicate urls.

And then you can load the url within pandas read_excel

I don't want to parse the table with skipping number of lines from the start and end. that's open for you to create a function to parse for example. the table if the row contain date on first column for example.

import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    links = set(x['href'] for x in soup.select('a[href$=ods]'))
    for link in links:
        df = pd.read_excel(link)
        print(df)
        #df.to_csv('data.csv',index= False)


if __name__ == "__main__":
    main("https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic")

import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    links = set(x['href'] for x in soup.select('a[href$=ods]'))
    for link in links:
        df = pd.read_excel(link)
        print(df)
        #df.to_csv('data.csv',index= False)


if __name__ == "__main__":
    main("https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic")

answered Feb 14, 2021 at 20:10

αԋɱҽԃ αмєяιcαη

11.6k3 gold badges23 silver badges58 bronze badges

3 Comments

zero Over a year ago

many thanks for giving this great example. Thanks for the explanation and for sharing your ideas & knowledge with us. Have a great day!

python_coder_ Over a year ago

Thanks for your answer. It results XLRDError: Openoffice.org ODS file; not supported'on the line df = pd.read_excel(link)

zero Over a year ago

many thanks dear αԋɱҽԃ αмєяιcαη - this looks awesome. Thanks for sharing your ideas with us. Greetings from spain (andalusia) where we still look forward to hear from you;)

Collectives™ on Stack Overflow

Creating a Data Frame from the URL of a website using python

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related