2

I'm trying to parse some returned html (from http://www.google.com/movies?near=37130 )to look for currently playing movies. The pattern I'm trying to match looks like:
<span dir=ltr>Clash of the Titans</span>

Of which there are several in the returned html.

I'm trying get an array of the movie titles with the following command:
titles = listings_html.split(/(<span dir=ltr>).*(<\/span>)/)

But I'm not getting the results I'm expecting. Can anyone see a problem with my approach or regex?

4
  • 1
    Please see stackoverflow.com/questions/1732348/… Commented Apr 3, 2010 at 15:34
  • 3
    Also, this question might just be the worst formatted question ever! Commented Apr 3, 2010 at 15:35
  • The thing is, someone always bitches if I don't post every single little comment in the code. So I was just trying to avoid that. Commented Apr 3, 2010 at 16:13
  • How about posting the URL to the page you want to parse and cleaning out the pasted HTML, por favor? Commented Apr 4, 2010 at 0:56

4 Answers 4

5

It is considered Verey Bad generally to parse HTML with RegExs since HTML does not have regular grammar. See the list of links to explanations (some from SO) here.

You should instead use a designated HTML library, such as this

Sign up to request clarification or add additional context in comments.

Comments

4

I didn't read the whole code you posted since it burned my eyes.

<span>.*</span>

This regex matches <span>hello</span> correctly, but fails at <span>hello</span><span>there</span> and matches the whole string. Remember that the * operator is greedy, so it will match the longest string possible. You can make it non-greedy by using .*? should make it work.

However, it's not wise to use regular expressions to parse HTML code.

1- You can't always parse HTML with regex. HTML is not regular.

2- It's very hard to write or maintain regex.

3- It's easy to break the regex by using an input like <span><a href="</span>"></a></span>.

Comments

3

To parse HTML with Ruby, use Nokogiri or hpricot.

3 Comments

I'd definitely use hpricot, it's really easy to use. There is good documentation in the readme here github.com/whymirror/hpricot
And I'd definitely use Nokogiri because it was able to handle malformed XML that hpricot puked on. :-) nokogiri.org
@Jamie, of the two, I'd recommend Nokogiri, too.
2

(it doesn't appear that the sample html you posted actually has any examples of the pattern you're trying to match.)

Alicia is correct that regex against html is generally a bad idea, and as your requirements become more complex it will break down.

That said, your example is pretty simple..

doc.scan(/<span dir=ltr>(.*)<\/span/) do |match|
   puts match               
end 

As mentioned, .* is usually greedy (and I expected to have to account for that), but it appears that when used within scan, you don't get greedy behavior. I was able to match several of these patterns in a single document.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.