Difficulty extracting HTML table with Python and Pandas

Question

I am trying to extract data from the HTML table on the following website: https://fuelkaki.sg/home

My Python code is as shown below. Pandas is unable to detect the Table. I suspect it is because Beautiful Soup is not able to capture the HTML code on the page properly.

import sys
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd

try:
    url = 'https://fuelkaki.sg/home'
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
    page=requests.get(url, headers=headers)
except Exception as e:
    error_type, error_obj, error_info = sys.exc_info()
    print ('ERROR FOR LINK:', url)
    print (error_type, 'Line:', error_info.tb_lineno)
    
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')

df = pd.read_html(page.text)
df

I have tried using Selenium as well (see code below), but still unable to capture the HTML table information.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://fuelkaki.sg/home'
options = Options()
options.binary_location = "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"    #chrome binary location specified here
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')


df = pd.read_html(page)
df

Any advise would be much appreciated

It is not an static page you can fetch its data using requests. — keramat
– keramat, Commented Mar 8, 2022 at 6:56
I have tried using Selenium (see above), but still to no avail — David
– David, Commented Mar 8, 2022 at 11:34

keramat · Accepted Answer · 2022-03-08 13:32:07Z

1

Use:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://fuelkaki.sg/home'
options = Options()

options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find("table", { "class" : "table" })
pd.DataFrame(np.array([x.text.replace('\u202c', '') for x in table.find_all('td')]).reshape(-1,5))

Output:

Please be aware that using website data can be unethical.

answered Mar 8, 2022 at 13:32

keramat

4,6138 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

David Over a year ago

I added "import numpy as np" to the code and it works now. Thanks

Collectives™ on Stack Overflow

Difficulty extracting HTML table with Python and Pandas

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related