1

I'm looking for a way on Linux to write a script that scrapes the text from a page which is generated by Javascript (specifically etherpad e.g. http://www.board.net). Ideally I'd like to use an existing tool but I haven't found a suitable one (e.g. lynx, but it doesn't support javascript, or Selenium, but it runs in a browser). Suggestions welcome.

If there's nothing suitable (which would seem surprising for such a simple need), maybe I can write something myself in Python. What useful Python classes exist for something like this?

1

1 Answer 1

1

One option is to still stick with Selenium, but use a headless PhantomJS.

See also:

Example (using firefox webdriver):

from selenium import webdriver

url = 'http://board.net/p/ThisIsBob%27sBoard/timeslider'
driver = webdriver.Firefox()
driver.get(url)

element = driver.find_element_by_id('padcontent')
print element.text

prints:

Here is some text I'd like to scrape
 I wonder how to go about it?
Sign up to request clarification or add additional context in comments.

6 Comments

I don't know Javascript myself, and it says it has a Javascript API. Is PhantomJS usable by somebody who doesn't know Javascript?
@user3149905 as far as I understand your task, you will write only a python code to get the data from a page you need. An example page you need to scrape would help me to help :)
@alexce: It looked to me like with PhantomJS I had to inspect a page for JS objects and then query them, or something like that, but I didn't study the API in depth. Here's a specific example page I just created that I'd like to be able to scrape: board.net/p/ThisIsBob%27sBoard/timeslider I'd specifically like the date off that page.
@alexce: So if I understand it correctly, yes I'm only writing python code, but I am introspecting the JS code so I need to know how JS is structured and how it works to make sense of the result. I don't :)
@user3149905 you just need to get the necessary element by the id (or xpath, or name, or whatever - docs). See the update on the answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.