Extract java script from html document using regular expression

Question

I am trying to extract java script from google.com using regular expression.

Program

import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall(r'<script>(.*?)</script>', gdoc)
print scriptlis

Output:

['']

Can any one tell me how to extract java script from html doc by using regular expression only.

@rid yes I am asking why its not working

Balakrishnan
– Balakrishnan

2013-08-07 16:15:11 +00:00
Commented Aug 7, 2013 at 16:15 — Balakrishnan
– Balakrishnan, Commented Aug 7, 2013 at 16:15

score 5 · Accepted Answer · 2013-08-07 16:36:46Z

5

This works:

import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis

The key here is (?si). The "s" sets the "dotall" flag (same as re.DOTALL), which makes Regex match over newlines. That was actually the root of your problem. The scripts on google.com span multiple lines, so Regex can't match them unless you tell it to include newlines in (.*?).

The "i" sets the "ignorcase" flag (same as re.IGNORECASE), which allows it to match anything that can be JavaScript. Now, this isn't entirely necessary because Google codes pretty well. But, if you had poor code that did stuff similar to <SCRIPT>...</SCRIPT>, you will need this flag.

edited Aug 7, 2013 at 16:36

answered Aug 7, 2013 at 16:25

user2555451

Sign up to request clarification or add additional context in comments.

5 Comments

woofmeow Over a year ago

+1 for working .. could you elaborate on the (?si) part more

Balakrishnan Over a year ago

I don't know which one is correct but when I compare your answer with @Antti Haapala answer it is not equal re.findall('(?si)<script>(.*?)</script>', gdoc) == re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S) Output: False

user2555451 Over a year ago

@woofmeow - There ya go.

woofmeow Over a year ago

@iCodez i googled it in the mean time ... but thanks .. learnt something (new lol) today :)

user2555451 Over a year ago

That's because his is getting attributes too. I didn't think you wanted this based off of the sample code you gave. If you do, this should work '(?si)<script.*?>(.*?)</script>'.

Burhan Khalid · Accepted Answer · 2013-08-07 16:34:55Z

1

If you don't have an issue with third party libraries, requests combined with BeautifulSoup makes for a great combination:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.google.com')
p = bs(r.content)
p.find_all('script')

answered Aug 7, 2013 at 16:34

Burhan Khalid

175k20 gold badges254 silver badges291 bronze badges

1 Comment

Balakrishnan Over a year ago

I am having some issue with BeautifulSoup. That's why only I want to use regular expression.

PepperoniPizza · Accepted Answer · 2013-08-07 16:22:32Z

0

I think the problem is that the text between <script> and </script> is several lines, so you could try something like this:

rg = re.compile('<script>(.*)</script>', re.DOTALL)
result = re.findall(rg, gdoc)

answered Aug 7, 2013 at 16:22

PepperoniPizza

9,18211 gold badges68 silver badges107 bronze badges

1 Comment

user2555451 Over a year ago

You want it nongreedy re.compile('<script>(.*?)</script>', re.DOTALL)

Antti Haapala · Accepted Answer · 2013-08-07 16:23:49Z

0

What you probably could try to do is

scriptlis = re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S)

Because most script tags are of type:

<script language="javascript" src="foo"></script>

or

<script language="javascript">alert("foo")</script>

and some even are <SCRIPT></SCRIPT>

Neither of which match your regex. My regex would grab attributes in group 1, and the possible inline code in group 2. And also all tags within HTML comments. But it is about the best possible without BeautifulSoup et al

edited Aug 7, 2013 at 16:23

answered Aug 7, 2013 at 16:17

Antti Haapala

135k23 gold badges297 silver badges349 bronze badges

4 Comments

user2555451 Over a year ago

You mean re.findall(r'<script\b\s*([^>]*)>(.*?)</script>', gdoc).

Balakrishnan Over a year ago

not working scriptlis = re.findall(r'<script\b\s*([^>]*)>(.*?)</script>',gdoc) Output: [('', '')]

Balakrishnan Over a year ago

I don't know which one is correct but when I compare your answer with @iCodez answer it is not equal re.findall('(?si)<script>(.*?)</script>', gdoc) == re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S) Output: False

Antti Haapala Over a year ago

ofc not, because my script will fetch the attributes too, and not only inline scripts.

Collectives™ on Stack Overflow

Extract java script from html document using regular expression

4 Answers 4

5 Comments

1 Comment

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related