0

I would like to parse the table "Table 1: Consumer Price Index, historical indices from 1924 (2015=100)" from here: https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen

This table: See picture

I am using Selenium to open the table that I want to parse (see code below). But the line with pd.read_html throws me the error message

ImportError: html5lib not found, please install it

even though I have installed html5lib (also checked using pip list, version 1.1 is installed). How can I best parse the table?

options = Options()

url = "https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen"
driver_no = webdriver.Chrome(options=options, executable_path=mypath)

driver_no.get(url)
sleep(2)
WebDriverWait(driver_no, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="attachment-table-figure-1"]/button')))
elem = driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button')
sleep(2)
driver_no.execute_script("arguments[0].scrollIntoView(true);", elem)
sleep(2)
driver_no.find_element(By.XPATH, '//*[@id="attachment-table-figure-1"]/button').click()

df_list = pd.read_html(driver_no.page_source, "html_parser")
driver_no.quit()
1
  • The page offers a dropdown to download a CSV or Excel version of the table. I'm not sure why you don't just use this? Commented Jul 16, 2022 at 18:11

1 Answer 1

2
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

browser.get("https://www.ssb.no/en/priser-og-prisindekser/konsumpriser/statistikk/konsumprisindeksen")
soup = BeautifulSoup(browser.page_source, 'html5lib')
table = soup.select('table')[1]
browser.quit()
final_list = []
for row in table.select('tr'):
    final_list.append([x.text for x in row.find_all(['td', 'th'])])
final_df = pd.DataFrame(final_list[1:], columns = final_list[:1])
final_df[:-2]

This returns the actual table:

        Y-avg2  Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep     Oct     Nov     Dec
0   2022    .   117.8   119.1   119.8   121.2   121.5   122.6   .   .   .   .   .   .
1   2021    116.1   114.1   114.9   114.6   115.0   114.9   115.3   116.3   116.3   117.5   117.2   118.1   118.9
2   2020    112.2   111.3   111.2   111.2   111.7   111.9   112.1   112.9   112.5   112.9   113.2   112.4   112.9
3   2019    110.8   109.3   110.2   110.4   110.8   110.5   110.6   111.4   110.6   111.1   111.3   111.6   111.3
4   2018    108.4   106.0   107.0   107.3   107.7   107.8   108.5   109.3   108.9   109.5   109.3   109.8   109.8
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
89  1933    2.7     2.7     2.7     2.7     2.7     2.7     2.7     2.7     2.8     2.7     2.7     2.7     2.7
90  1932    2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8
91  1931    2.8     2.9     2.9     2.9     2.9     2.8     2.8     2.8     2.8     2.8     2.8     2.8     2.8
92  1930    3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     2.9     2.9     2.9
93  1929    3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1     3.1

Regarding your 'html5lib' issue, without looking at your actual install/virtualenv etc, there is not much help I can offer. Maybe try reinstalling it, or try installing it in a new virtual environment.

Sign up to request clarification or add additional context in comments.

1 Comment

Unfortunately, this is not the table I am looking to parse. Please refer to the picture I posted in my question for the exact table. It is the one containing the historical data back to 1924

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.