Scrape Dynamic contents created by Javascript using Python

Question

I want to scrap DIV content created by javascript function by using python script. I have tried with BS4 and by doing with that i'm not able to get dynamic data. instead it shows only the source code.

Sample code:

import requests
from bs4 import BeautifulSoup

URL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')


for row in soup.findAll('div', attrs = {'class':'quote'}):
    print(row)


print(soup.prettify())

Sample HTML source code is in Pastebin

Sample data to be extracted:

BeautifulSoup can not parse the content created with JS, you need to use selenium maybe. — BcK
– BcK, Commented Apr 20, 2018 at 10:06
Does this answer your question? Web-scraping JavaScript page with Python — ggorlen
– ggorlen, Commented Dec 28, 2020 at 4:30

radzak · Accepted Answer · 2018-04-20 16:40:56Z

10

The initial HTML does not contain the data you want to scrape, that's why using only BeautifulSoup is not enough. You can load the page with Selenium and then scrape the content.

Code:

import json

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

html = None
url = 'http://demo-tableau.bitballoon.com/'
selector = '#dataTarget > div'
delay = 10  # seconds

browser = webdriver.Chrome()
browser.get(url)

try:
    # wait for button to be enabled
    WebDriverWait(browser, delay).until(
        EC.element_to_be_clickable((By.ID, 'getData'))
    )
    button = browser.find_element_by_id('getData')
    button.click()

    # wait for data to be loaded
    WebDriverWait(browser, delay).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, selector))
    )
except TimeoutException:
    print('Loading took too much time!')
else:
    html = browser.page_source
finally:
    browser.quit()

if html:
    soup = BeautifulSoup(html, 'lxml')
    raw_data = soup.select_one(selector).text
    data = json.loads(raw_data)

    import pprint
    pprint.pprint(data)

Output:

[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
  {'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
  {'formattedValue': 'ALEX', 'value': 'ALEX'},
  {'formattedValue': '16.70000', 'value': '16.7'},
  {'formattedValue': '-84.40000', 'value': '-84.4'},
  {'formattedValue': '30', 'value': '30'}],
  ...
]

The code assumes that the button is initially disabled: <button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button> and data is not loaded automatically, but due to the button being clicked. Therefore you need to delete this line: setTimeout(function(){ getUnderlyingData(); }, 3000);.

You can find a working demo of your example here: http://demo-tableau.bitballoon.com/.

answered Apr 20, 2018 at 16:40

radzak

3,1581 gold badge21 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

frank hk Over a year ago

One of the most helpful answer that i ever received. You saved my day. I appreciate your quick response and skill.

Hefaz Over a year ago

@Jatimir how can I scrape the same problem as mentioned in the question, but the element is generated by a PHP script. Here is my question. link

trustory Over a year ago

any way to use this to scrape this site? tradingview.com/symbols/INDEX-MMTW

Collectives™ on Stack Overflow

Scrape Dynamic contents created by Javascript using Python

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related