0

I am trying to scrape a JavaScript table from a website to a dataframe. The soup outputs only the script location and not access to the table. The MWE and soup output are given below. I am trying to scrape the table from here to a dataframe, is this possible and how?

MWE

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
                Chrome/72.0.3626.28 Safari/537.36'}
session = requests.Session()
website = session.get('https://iborrowdesk.com', headers=headers, timeout=10)
website.raise_for_status()
soup = BeautifulSoup(website.text, 'lxml')
table = soup.find('table', class_='table table-condensed table-hover')
data = pd.read_html(str(table))[0]

Soup output

<html><head><link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/site.webmanifest" rel="manifest"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="#ffffff" name="theme-color"/>
<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.6/flatly/bootstrap.min.css" rel="stylesheet"/>
<meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>IBorrowDesk</title><script src="//cdn.thisiswaldo.com/static/js/9754.js"></script>
</head><body><div class="container"></div><script src="/static/main.bundle.js?39ed89dd02e44899ebb4">
</script></body></html>
4
  • What is a "JavaScript table"? Do you mean "table generated by JavaScript"? Commented Oct 6, 2022 at 4:09
  • you'll need to execute the javascript that creates the table somehow - web scrapping is hard Commented Oct 6, 2022 at 4:15
  • @tadman The latter "table generated by JavaScript". Commented Oct 6, 2022 at 4:44
  • That probably means you can just grab the data, no need to "scrape". Look at the network requests more closely. You may have it in JSON or XML already. Commented Oct 6, 2022 at 4:54

2 Answers 2

2

You can use requests since they are exposing an api.

import json

import pandas as pd
import requests


def get_data() -> pd.DataFrame:
    url = "https://iborrowdesk.com/api/most_expensive"

    with requests.Session() as request:
        response = request.get(url, timeout=10)
    if response.status_code != 200:
        print(response.raise_for_status())

    data = json.loads(response.text)

    return pd.json_normalize(data=data["results"])


df = get_data()
Sign up to request clarification or add additional context in comments.

Comments

0

As Jason Baker mentioned in his post, you can use the API that's provided. Alternatively, you can use Selenium to scrape the data as well. This question (Python webscraping: BeautifulSoup not showing all html source content) is relevant to your question. It contains an explanation of why requests.Session().get(url) is unable to retrieve all of the elements in the DOM. It's because the elements are created using JavaScript, so the page source HTML doesn't initially contain those elements, they're inserted using JavaScript. The question I linked also contains a code snippet in the answers that I've updated to match your question:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

browser = webdriver.Firefox()
browser.get('https://iborrowdesk.com/')
table = browser.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
data = pd.read_html(table)[0]
print(data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.