0

i am basically scraping data from a particular page. I have this code:

regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

... rest of code omitted ...

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

pattern = re.compile(regex)
movies = re.findall(pattern, htmlText)

# Lines below will always returns empty.
if not movies:
    print "List is empty. Printing source instead...", "\n\n"
    print htmlText
else:
    print movies

content of htmlText:

<ul class="w462">

... bunch of <li>s (the content i want to retrieve).

</ul>

htmlText contains the correct source (i tried to ctrl+F it and i can verify that it contains the desired ul element. It just that my regex unable to get the desired content.

I have tried to use this instead:

movies = re.findall(r'<ul class="w462">(.*?)</ul>', htmlText)

Does anyone know what went wrong?

2
  • 2
    Why aren't you using an HTML parser to parse HTML? Commented Sep 20, 2013 at 4:05
  • Anyways, your data you seem to find by doing control+F could have been created by some JavaScript, which I don't think regex can catch. (Don't quote me, I could be completely wrong). Consider selenium. I've never used it, but I think it's the right tool Commented Sep 20, 2013 at 4:17

1 Answer 1

2

By default, . in a regexp matches any character except for a newline. So your regexp can't match anything that spans more than one line (that contains at least one newline).

Change the compilation line to:

pattern = re.compile(regex, re.DOTALL)

to change the meaning of .. With re.DOTALL, . will match any character (including newline).

Sign up to request clarification or add additional context in comments.

4 Comments

Ya, for fancy parsing you definitely want a real HTML parsing module, but for simple tasks like this regexps are fine. Don't heed the haters - LOL ;-)
Regular expressions are never appropriate in the context of parsing markup languages. NEVER.
Heh - @username55 was stuck, and is now unstuck. This was a Python question: "practicality beats purity" ;-)
@user2618501 Yes, they can be appropriate at sometimes. If you're dealing with limited HTML, it's fine. Stop being so pedantic over it :p

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.