Python webscraping: How to parse html table, selenium

Question

I would like to parse the table "Table 1: Consumer Price Index, historical indices from 1924 (2015=100)" from here: https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen

This table:

I am using Selenium to open the table that I want to parse (see code below). But the line with pd.read_html throws me the error message

ImportError: html5lib not found, please install it

even though I have installed html5lib (also checked using pip list, version 1.1 is installed). How can I best parse the table?

options = Options()

url = "https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen"
driver_no = webdriver.Chrome(options=options, executable_path=mypath)

driver_no.get(url)
sleep(2)
WebDriverWait(driver_no, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="attachment-table-figure-1"]/button')))
elem = driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button')
sleep(2)
driver_no.execute_script("arguments[0].scrollIntoView(true);", elem)
sleep(2)
driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button').click()

df_list = pd.read_html(driver_no.page_source, "html_parser")
driver_no.quit()

The page offers a dropdown to download a CSV or Excel version of the table. I'm not sure why you don't just use this? — C. Peck
– C. Peck, Commented Jul 16, 2022 at 18:11

Barry the Platipus · Accepted Answer · 2022-07-16 18:56:40Z

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

browser.get("https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen")
soup = BeautifulSoup(browser.page_source, 'html5lib')
table = soup.select('table')[1]
browser.quit()
final_list = []
for row in table.select('tr'):
    final_list.append([x.text for x in row.find_all(['td', 'th'])])
final_df = pd.DataFrame(final_list[1:], columns = final_list[:1])
final_df[:-2]

This returns the actual table:

        Y-avg2  Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec
0   2022    .   117.8   119.1   119.8   121.2   121.5   122.6   .   .   .   .   .   .
1   2021    116.1   114.1   114.9   114.6   115.0   114.9   115.3   116.3   116.3   117.5   117.2   118.1   118.9
2   2020    112.2   111.3   111.2   111.2   111.7   111.9   112.1   112.9   112.5   112.9   113.2   112.4   112.9
3   2019    110.8   109.3   110.2   110.4   110.8   110.5   110.6   111.4   110.6   111.1   111.3   111.6   111.3
4   2018    108.4   106.0   107.0   107.3   107.7   107.8   108.5   109.3   108.9   109.5   109.3   109.8   109.8
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
89  1933    2.7     2.7     2.7     2.7     2.7     2.7     2.7     2.7     2.8     2.7     2.7     2.7     2.7
90  1932    2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8
91  1931    2.8     2.9     2.9     2.9     2.9     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8
92  1930    3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     2.9     2.9     2.9
93  1929    3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1

Regarding your 'html5lib' issue, without looking at your actual install/virtualenv etc, there is not much help I can offer. Maybe try reinstalling it, or try installing it in a new virtual environment.

Unfortunately, this is not the table I am looking to parse. Please refer to the picture I posted in my question for the exact table. It is the one containing the historical data back to 1924

Collectives™ on Stack Overflow

Python webscraping: How to parse html table, selenium

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related