1

I have been working on web scraping from python and I want to create a dataframe from a URL of a website. The data format of the file is .ods. I have tried downloading the .ods file to the computer using beautifulsoup and then reading it to create a dataframe. The file itself contains a header that has to be removed. I achieved successful result through this method and my code is attached below.

from pandas_ods_reader import read_ods
import bs4            
import requests
import pandas as pd

url = "https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic"
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text, "html.parser")
i=0
for link in soup.find_all('a', href=True):
    i+=1
    href = link['href']

    if any(href.endswith(x) for x in ['.ods']):
        #print(href)
        file_data = requests.get(href).content
        with open('data.ods', "wb") as file:
            file.write(file_data)

df = read_ods('data.ods', 1, headers=False)[6:-44]
df.index = range(0, 346)
df.columns = df.iloc[0]
df.drop(0)
df

Now I want to figure out whether this can be achieved directly without downloading the .ods file. If there is a way to directly create a dataframe from the .ods file available in the webpage, that would serve my purpose. Please suggest a suitable code if this is achievable

1
  • Shorten beginning section to just soup.select_one('.thumbnail')['href'] Commented Feb 14, 2021 at 20:02

2 Answers 2

1

Panda's read_excel allows you to load dataframes directly from urls. You can also use it to load odf files after installing pip install odfpy:

import bs4            
import requests
import pandas as pd

url = "https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic"
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text, "html.parser")
i=0
for link in soup.find_all('a', href=True):
    i+=1
    href = link['href']
    #print(href)
    if any(href.endswith(x) for x in ['.ods']):
        print(href)
        df = pd.read_excel(href, header=None)[6:-44]
        df.index = range(0, 346)
        df.columns = df.iloc[0]
        df.drop(0)
df
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your answer. It results XLRDError: Openoffice.org ODS file; not supported on the line df = pd.read_excel(href, header=None)[6:-44]
I installed the odfpy library and tried with df = pd.read_excel(href, header=None, engine='odf')[6:-44]. But this one raised a ValueError: Unknown engine: odf
This is probably related to this issue. Try downgrading xlrd.
1

I've no idea why you pickup all anchor tags ? a, BTW, I'll imagine that you will apply the same logic over multiple pages where there can be one or more of files.

So the right approach is to append to set where there can't be any duplicate urls.

And then you can load the url within pandas read_excel

I don't want to parse the table with skipping number of lines from the start and end. that's open for you to create a function to parse for example. the table if the row contain date on first column for example.

import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    links = set(x['href'] for x in soup.select('a[href$=ods]'))
    for link in links:
        df = pd.read_excel(link)
        print(df)
        #df.to_csv('data.csv',index= False)


if __name__ == "__main__":
    main("https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic")
import requests
from bs4 import BeautifulSoup
import pandas as pd


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    links = set(x['href'] for x in soup.select('a[href$=ods]'))
    for link in links:
        df = pd.read_excel(link)
        print(df)
        #df.to_csv('data.csv',index= False)


if __name__ == "__main__":
    main("https://www.gov.uk/government/statistics/transport-use-during-the-coronavirus-covid-19-pandemic")

3 Comments

many thanks for giving this great example. Thanks for the explanation and for sharing your ideas & knowledge with us. Have a great day!
Thanks for your answer. It results XLRDError: Openoffice.org ODS file; not supported'on the line df = pd.read_excel(link)
many thanks dear αԋɱҽԃ αмєяιcαη - this looks awesome. Thanks for sharing your ideas with us. Greetings from spain (andalusia) where we still look forward to hear from you;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.