Extracting data from JavaScript (Python Scraper)

Question

I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.

JavaScript:

(function(){DOM.appendContent(this, HTML("<html>"));;})

I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.

Any thoughts?

If it contained a ", would that need to be escaped?

Jens
– Jens

2011-01-28 07:32:18 +00:00
Commented Jan 28, 2011 at 7:32 — Jens
– Jens, Commented Jan 28, 2011 at 7:32
Yes, it would, which adds to the complexity.

skeggse
– skeggse

2011-03-09 18:42:46 +00:00
Commented Mar 9, 2011 at 18:42 — skeggse
– skeggse, Commented Mar 9, 2011 at 18:42

edanfalls · Accepted Answer · 2011-01-28 09:31:43Z

2

Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?

string[42:-7]

As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.

edited Jan 28, 2011 at 9:31

answered Jan 28, 2011 at 9:17

edanfalls

5301 gold badge8 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

skeggse Over a year ago

That's actually the best way to do this, I'm not sure why this didn't occur to me initally, but then the challenge is to parse that content properly (unescaping and the suchlike).

Jens · Accepted Answer · 2011-01-28 07:38:55Z

1

If every occurance of " inside the html code would be escaped by using \" (it is a JavaScript string after all), you could use

HTML\("((?:\\"|.)*?)"\)

to get the parameter to HTML into the first capturing group.

Note that this Regex is not yet escaped to be a Javascript String itself.

answered Jan 28, 2011 at 7:38

Jens

25.7k9 gold badges80 silver badges120 bronze badges

Collectives™ on Stack Overflow

Extracting data from JavaScript (Python Scraper)

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related