ruby regex, parsing html

Question

I'm trying to parse some returned html (from http://www.google.com/movies?near=37130 )to look for currently playing movies. The pattern I'm trying to match looks like:
Clash of the Titans

Of which there are several in the returned html.

I'm trying get an array of the movie titles with the following command:
titles = listings_html.split(/().*(<\/span>)/)

But I'm not getting the results I'm expecting. Can anyone see a problem with my approach or regex?

Also, this question might just be the worst formatted question ever! — Jørn Schou-Rode
– Jørn Schou-Rode, Commented Apr 3, 2010 at 15:35
The thing is, someone always bitches if I don't post every single little comment in the code. So I was just trying to avoid that. — danwoods
– danwoods, Commented Apr 3, 2010 at 16:13
How about posting the URL to the page you want to parse and cleaning out the pasted HTML, por favor? — the Tin Man
– the Tin Man, Commented Apr 4, 2010 at 0:56

Community · Accepted Answer · 2017-05-23 12:13:38Z

5

It is considered Verey Bad generally to parse HTML with RegExs since HTML does not have regular grammar. See the list of links to explanations (some from SO) here.

You should instead use a designated HTML library, such as this

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered Apr 3, 2010 at 15:32

Alice

2,4612 gold badges16 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tiftik · Accepted Answer · 2010-04-03 15:48:44Z

4

I didn't read the whole code you posted since it burned my eyes.

<span>.*</span>

This regex matches hello correctly, but fails at hellothere and matches the whole string. Remember that the * operator is greedy, so it will match the longest string possible. You can make it non-greedy by using .*? should make it work.

However, it's not wise to use regular expressions to parse HTML code.

1- You can't always parse HTML with regex. HTML is not regular.

2- It's very hard to write or maintain regex.

3- It's easy to break the regex by using an input like <a href=""></a>.

answered Apr 3, 2010 at 15:48

tiftik

9985 silver badges10 bronze badges

Comments

maček · Accepted Answer · 2010-04-03 15:37:37Z

3

To parse HTML with Ruby, use Nokogiri or hpricot.

answered Apr 3, 2010 at 15:37

maček

78k37 gold badges172 silver badges200 bronze badges

3 Comments

Jamie Over a year ago

I'd definitely use hpricot, it's really easy to use. There is good documentation in the readme here github.com/whymirror/hpricot

the Tin Man Over a year ago

And I'd definitely use Nokogiri because it was able to handle malformed XML that hpricot puked on. :-) nokogiri.org

maček Over a year ago

@Jamie, of the two, I'd recommend Nokogiri, too.

Mike Cargal · Accepted Answer · 2010-04-03 15:55:03Z

2

(it doesn't appear that the sample html you posted actually has any examples of the pattern you're trying to match.)

Alicia is correct that regex against html is generally a bad idea, and as your requirements become more complex it will break down.

That said, your example is pretty simple..

doc.scan(/<span dir=ltr>(.*)<\/span/) do |match|
   puts match               
end

As mentioned, .* is usually greedy (and I expected to have to account for that), but it appears that when used within scan, you don't get greedy behavior. I was able to match several of these patterns in a single document.

edited Apr 3, 2010 at 15:55

answered Apr 3, 2010 at 15:49

Mike Cargal

6,8053 gold badges25 silver badges29 bronze badges

Collectives™ on Stack Overflow

ruby regex, parsing html

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related