Getting html source when some html is generated by javascript

Question

I am attempting to get the source code from a webpage including html that is generated by javascript. My code currently is as follows:

from selenium import webdriver
from bs4 import BeautifulSoup

case_url = "http://na.leagueoflegends.com/tribunal/en/case/5555631/#nogo"
try:
    browser = webdriver.Firefox()
    browser.get(case_url)
    url = browser.page_source
    print url
    browser.close
except:
    ...

soup=BeautifulSoup(url)
...extraction code that finds the right tags, but they are empty...

When I print the source stored in url, it prints the usual HTML, but is missing the generated html information. How do I get the same HTML as when I press f12 (but I would prefer to do this programatically)?

Robbie Wareham · Accepted Answer · 2014-04-25 07:30:43Z

3

Further to alexce's answer above, your underlying issue was that you were extracting the HTML before the JavaScript had generated it. Selenium returns control as soon as the browser has loaded and does not wait for any post load JavaScript generated HTML.

By using "find_elements", you will be automatically waiting for the elements to appear (depending on the timeout set when instantiating your driver).

If you were to call get "page_source" after the "find_elements", then you would see the full HTML.

I have automated many dynamically client side generated web pages, and have had no issues providing you wait for the HTML to be rendered.

Alexce is correct that there is no need to use BeautifulSoup, but I wanted to make it clear that Selenium is perfectly able to automate JavaScript generated HTML

answered Apr 25, 2014 at 7:30

Robbie Wareham

3,4581 gold badge23 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

alecxe · Accepted Answer · 2014-04-25 01:20:00Z

2

You don't really need to use BeautifulSoup for parsing html in this case, selenium itself is pretty powerful in terms of Locating Elements.

Here's how you can parse the contents of each tab/game one by one:

from selenium import webdriver

case_url = "http://na.leagueoflegends.com/tribunal/en/case/5555631/#nogo"
browser = webdriver.Firefox()
browser.get(case_url)

game_tabs = browser.find_elements_by_xpath('//a[contains(@id, "tab-")]')
for index, tab in enumerate(game_tabs, start=1):
    tab.click()
    game = browser.find_element_by_id('game%d' % index)
    game_type = game.find_element_by_id('stat-type-fill').text
    game_length = game.find_element_by_id('stat-length-fill').text
    game_outcome = game.find_element_by_id('stat-outcome-fill').text

    game_chat = game.find_element_by_class_name('chat-log')
    enemy_chat = [msg.text for msg in game_chat.find_elements_by_class_name('enemy') if msg.text]
    ally_chat = [msg.text for msg in game_chat.find_elements_by_class_name('ally') if msg.text]

    print game_type, game_length, game_outcome
    print "Enemy chat: ", enemy_chat
    print "Ally chat: ", ally_chat
    print "------"

prints:

Classic 34:48 Loss
Enemy chat:  [u'Akali [All] [00:01:38] lol', ... ]
Ally chat:  [u'Gangplank [All] [00:00:12] anyone remember the april fools lee sin spotlight? lol', ... ]
------
Dominion 19:22 Loss
Enemy chat:  [u'Evelynn [All] [00:00:10] Our GP has a Ti-83', ... ]
Ally chat:  [u'Miss Fortune [All] [00:00:18] arr ye wodden computer needs to walk the plank!', ... ]

edited Apr 25, 2014 at 1:20

answered Apr 25, 2014 at 0:33

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

9 Comments

Kyle Grage Over a year ago

I basically wrote the code before I read to use selenium. There would be a lot of refactoring to get rid of BeautifulSoup. The selenium methods do not seem to find the elements generated by javascript either.

alecxe Over a year ago

@KyleGrage ok, can you share the link or provide any info so it can be possible for others to debug/test the problem? Thanks.

alecxe Over a year ago

@KyleGrage thanks, what elements do you need to get?

Kyle Grage Over a year ago

here was my soup code

game1_html = soup.find("div", {"id": "game1"})     ... game5_html = soup.find("div", {"id": "game5"})  #gameX_html is game_html in the next bit of code      game_type = game_html.find("p", {"id": "stat-type-fill"}).string     game_length = game_html.find("p", {"id": "stat-length-fill"}).string     game_outcome = game_html.find("p", {"id": "stat-outcome-fill"}).string     game_chat = game_html.find("table", {"class": "chat-log"}).string

Kyle Grage Over a year ago

Yes, that helps a lot. However, I guess I should have mentioned this. Later, I sort the chat into different categories (such that I needed the html filters inside the chat) for example, this is what I had ` #Ally chat processing achat_arr = game_chat.findAll("tr", {"class": "ally alliesFilter"}) aallchat_arr = game_chat.findAll("tr", {"class": "ally alliesFilter enemiesFilter"})`

|

Collectives™ on Stack Overflow

Getting html source when some html is generated by javascript

2 Answers 2

Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related