2

I am trying to extract java script from google.com using regular expression.

Program

import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall(r'<script>(.*?)</script>', gdoc)
print scriptlis

Output:

['']

Can any one tell me how to extract java script from html doc by using regular expression only.

1
  • @rid yes I am asking why its not working Commented Aug 7, 2013 at 16:15

4 Answers 4

5

This works:

import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis

The key here is (?si). The "s" sets the "dotall" flag (same as re.DOTALL), which makes Regex match over newlines. That was actually the root of your problem. The scripts on google.com span multiple lines, so Regex can't match them unless you tell it to include newlines in (.*?).

The "i" sets the "ignorcase" flag (same as re.IGNORECASE), which allows it to match anything that can be JavaScript. Now, this isn't entirely necessary because Google codes pretty well. But, if you had poor code that did stuff similar to <SCRIPT>...</SCRIPT>, you will need this flag.

Sign up to request clarification or add additional context in comments.

5 Comments

+1 for working .. could you elaborate on the (?si) part more
I don't know which one is correct but when I compare your answer with @Antti Haapala answer it is not equal re.findall('(?si)<script>(.*?)</script>', gdoc) == re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S) Output: False
@woofmeow - There ya go.
@iCodez i googled it in the mean time ... but thanks .. learnt something (new lol) today :)
That's because his is getting attributes too. I didn't think you wanted this based off of the sample code you gave. If you do, this should work '(?si)<script.*?>(.*?)</script>'.
1

If you don't have an issue with third party libraries, requests combined with BeautifulSoup makes for a great combination:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.google.com')
p = bs(r.content)
p.find_all('script')

1 Comment

I am having some issue with BeautifulSoup. That's why only I want to use regular expression.
0

I think the problem is that the text between <script> and </script> is several lines, so you could try something like this:

rg = re.compile('<script>(.*)</script>', re.DOTALL)
result = re.findall(rg, gdoc)

1 Comment

You want it nongreedy re.compile('<script>(.*?)</script>', re.DOTALL)
0

What you probably could try to do is

scriptlis = re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S)

Because most script tags are of type:

<script language="javascript" src="foo"></script>

or

<script language="javascript">alert("foo")</script>

and some even are <SCRIPT></SCRIPT>

Neither of which match your regex. My regex would grab attributes in group 1, and the possible inline code in group 2. And also all tags within HTML comments. But it is about the best possible without BeautifulSoup et al

4 Comments

You mean re.findall(r'<script\b\s*([^>]*)>(.*?)</script>', gdoc).
not working scriptlis = re.findall(r'<script\b\s*([^>]*)>(.*?)</script>',gdoc) Output: [('', '')]
I don't know which one is correct but when I compare your answer with @iCodez answer it is not equal re.findall('(?si)<script>(.*?)</script>', gdoc) == re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S) Output: False
ofc not, because my script will fetch the attributes too, and not only inline scripts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.