Screen scraping dynamic webpage in python with Ghost.py

Question

ghost = Ghost()
page, rcs = ghost.open(https://soundcloud.com/passionpit/sets/favorites)
page, rcs = ghost.wait_for_page_loaded()
songs = ghost.evaluate("document.getElementsByClassName('soundTitle__title');")
print songs

I am attempting to use the above code to find all html elements on the above page that have the class 'soundTitle__title' however as of right now my output is

QFont::setPixelSize: Pixel size <= 0 (0)
({PyQt4.QtCore.QString(u'length'): 0.0}, [])

Can anyone help me see where my problem is? When I run document.getElementsByClassName('soundTitle__title') in my browsers console I get the output I expect, why is the Python output different?

Or is there some way for me to use Ghost.py or another similar library to get the source of the page after the JavaScript has run (the source seen when inspecting an element with browser developer tools)?

I'm sorry, but I just have to do this with lxml.html. from lxml import html html.parse("https://soundcloud.com/passionpit/sets/favorites").getroot().cssselect(".soundTitle__title") — ToonAlfrink
– ToonAlfrink, Commented Jun 24, 2014 at 22:12
I tried running your code and ran into some issues. My output is IOError: Error reading file 'https://soundcloud.com/passionpit/sets/favorites': failed to load external entity "https://soundcloud.com/passionpit/sets/favorites" — walshie4
– walshie4, Commented Jun 24, 2014 at 22:20

bencassedy · Accepted Answer · 2014-06-29 15:03:34Z

I got this working, and would recommend, using Splinter, which is basically just running phantomjs and selenium under the hood.

You'll need to run pip install splinter and also install phantomjs on your machine, either by downloading/untarring or npm -g install phantomjs if you have npm, etc. But overall the installation and dependencies are minimal and straightforward.

The following code returns 'Ryn Weaver - OctaHate', which I'm assuming is what you're looking for, although without more context I can't be totally sure.

from splinter import Browser

browser = Browser('phantomjs')
browser.visit('https://soundcloud.com/passionpit/sets/favorites')
songs = browser.find_by_xpath("//a[contains(@class, 'soundTitle__title')]")
if songs:
    for song in songs:
        print song.text
else:
    print "there aren't any songs"

You'll also notice that I had to do an xpath-contains to get the class description that you were looking for; so, you might be running into a problem when trying to access that class by the notation you were using - there is a span element and also an anchor element that both contain 'soundTitle__title' but as far as I could tell, only the 'a' element had text and I would guess that's what you're looking for. But if you want both you could just do browser.find_by_xpath("//*[contains(@class, 'soundTitle__title')]")

Collectives™ on Stack Overflow

Screen scraping dynamic webpage in python with Ghost.py

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related