getting html dynamic content python3

Question

I wanted to get a part of html dynamic content from a website, I can see this content in "inspect element" but not in "view source". I tried to use BeautifulSoup and selenium libraries with no success, since after loading the page I need to press some screen buttons to load the content.

For example, in the website http://play.typeracer.com I can load its html source code but I can't load the content that shows up after pressing "Practice" on the webpage. ( tables and text)

Hope I was explicit, thanks for your attention

Check out the requests-html package. It allows you to render a page before extracting data. — RandomDude
– RandomDude, Commented Jul 29, 2018 at 13:09
When using selenium's webdriver I was able to open firefox and press the key with "driver = webdriver.Firefox()" and " driver.get("website.com")" . But if I load the content after making any key press it gives me an error and crashes the program. I will check it. — Miguel
– Miguel, Commented Jul 29, 2018 at 13:15
It is unclear to me what you are actually trying to achieve. You want to scrape the content or you want to automate/simulate a website user? Please give a full example - otherwise i can't help you — RandomDude
– RandomDude, Commented Jul 29, 2018 at 13:55
I want to scrape content, for example getting the text that you have to write into a .txt file. — Miguel
– Miguel, Commented Jul 29, 2018 at 14:05
By text that you have to write I mean the sentences that are part of the game typeracer, The ones that show up when you press "practice" for example — Miguel
– Miguel, Commented Jul 29, 2018 at 14:06

RandomDude · Accepted Answer · 2018-07-29 15:40:49Z

2

Here is a solution using Selenium and Firefox:

Open a browser window and navigating to the url
Waiting till the link for practice appears
Extracting all span elements that hold part of the text
Create the output string. In case the first word has only one letter there will be only 2 span elements. If the word has more than one letter there will be 3 span elements.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


url = 'http://play.typeracer.com/'
browser = webdriver.Firefox()
browser.get(url)

try:  # waiting till link is loaded
    element = WebDriverWait(browser, 30).until(
        EC.presence_of_element_located((By.LINK_TEXT, 'Practice')))
finally:  # link loaded -> click it
    element.click()

try:  # wait till text is loaded
    WebDriverWait(browser, 30).until(
        EC.presence_of_element_located((By.XPATH, '//span[@unselectable="on"]')))
finally:  # extract text 
    spans = browser.find_elements_by_xpath('//span[@unselectable="on"]')
    if len(spans) == 2:  # first word has only one letter
        text = f'{spans[0].text} {spans[1].text}'
    elif len(spans) == 3:  # first word has more than one letter
        text = f'{spans[0].text}{spans[1].text} {spans[2].text}'
    else:
        text = ' '.join([span.text for span in spans])
        print('special case that is not handled yet: {text}')


print(text)
>>> 'Scissors cuts paper. Paper covers rock. Rock crushes lizard. Lizard poisons Spock. Spock smashes scissors. Scissors decapitates lizard. Lizard eats paper. Paper disproves Spock. Spock vaporizes rock. And as it always has, rock crushes scissors.'

Update

Just in case you also want to automate the typing afterwards ;)

try:
    txt_input = WebDriverWait(browser, 30).until(
        EC.presence_of_element_located((By.XPATH,
            '//input[@class="txtInput" and @autocorrect="off"]')))
finally:
    for letter in text:
        txt_input.send_keys(letter)

The reason for the try:... finally: ... blocks is that we have to wait till the content is loaded - which can sometimes take quite a bit.

edited Jul 29, 2018 at 15:40

answered Jul 29, 2018 at 14:45

RandomDude

1,14119 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Miguel Over a year ago

Thanks, amazing answer! Can you just explain why does searching elemnt require "//" at the start and also what exactly does WebDriverWait do? shouldn't it wait automatically and thus not requiring the try, finally?

RandomDude Over a year ago

// is part of the xpath syntax. WebDriverWait waits up to 30s in our case till element is found. try finally makes sure that the code within finally only gets executed if the code in try ran without an exception.

RandomDude Over a year ago

My code is not meant to be perfect - just a quick and dirty solution that should give you enough to understand how it works ;)

Miguel Over a year ago

Yes it helped understanding the webdriver. Thanks!

Collectives™ on Stack Overflow

getting html dynamic content python3

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related