0

I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.

JavaScript:

(function(){DOM.appendContent(this, HTML("<html>"));;})

I need to extract the <html>, but I'm not entirely sure how to do so. The <html> itself can contain basically every character under the sun, so [^"] won't work.

Any thoughts?

2
  • If it contained a ", would that need to be escaped? Commented Jan 28, 2011 at 7:32
  • Yes, it would, which adds to the complexity. Commented Mar 9, 2011 at 18:42

2 Answers 2

2

Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?

string[42:-7]

As well as being quicker than a regex, it then doesn't matter if quotes inside <html> are escaped or not.

Sign up to request clarification or add additional context in comments.

1 Comment

That's actually the best way to do this, I'm not sure why this didn't occur to me initally, but then the challenge is to parse that content properly (unescaping and the suchlike).
1

If every occurance of " inside the html code would be escaped by using \" (it is a JavaScript string after all), you could use

HTML\("((?:\\"|.)*?)"\)

to get the parameter to HTML into the first capturing group.

Note that this Regex is not yet escaped to be a Javascript String itself.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.