Extract string from HTML tags using RegExp (Ruby)

Question

I would like to extract "toast" from a string <h1>test</h1><div>toast</div>. What regular expression could isolate such a string?

Edit: Thanks to the user who who corrected the formatting.

More Info: There will always only be one instance of the div tag, the information inside may change but there will never be another div tag in the same string (the string is larger than the given sample)

Thanks!

based on what? do you just want all text within any div? this is probably best to do with some sort of dom parser rather than regex. — Smern
– Smern, Commented Aug 7, 2013 at 17:47
@smerny sorry, I fixed the question. My boss is requiring me to use regex, so I have no choice :/ — John Dough
– John Dough, Commented Aug 7, 2013 at 17:50
Nokogiri is the best tool to parse the HTML and XML stuffs.. — Arup Rakshit
– Arup Rakshit, Commented Aug 7, 2013 at 17:52
We need more information. Which part of the string is variable? For example, a naive solution could be regex = /<h1>test<\/h1><div>([^<]*)<\/div>/ — James Lim
– James Lim, Commented Aug 7, 2013 at 17:54
Well, this is just a small part of the entire string so no easy solutions work unfortunately (I tried those, but the regex is way too clunky). All the tags will always remain the same, it's the content inside (i.e. "toast") that will change — John Dough
– John Dough, Commented Aug 7, 2013 at 18:02

Arup Rakshit · Accepted Answer · 2013-08-07 17:48:07Z

6

You can use Nokogiri.

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse("<div> test </div> <div> toast </div>")
doc.css('div').map(&:text)
# => [" test ", " toast "]

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse("<h1>test</h1><div>toast</div>")
doc.at_css('div').text
# => "toast"

answered Aug 7, 2013 at 17:48

Arup Rakshit

119k30 gold badges270 silver badges328 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

John Dough Over a year ago

Sorry, I fixed the question. This shouldn't be that complicated, right?

Andy Lester Over a year ago

Using an HTML parser is not complicated. Dealing with changes in your data that you don't expect, but are still perfectly valid HTML, is what is complicated. A little time spent up front with a proper HTML parser will save you hours of debugging and heartache down the road.

Smern · Accepted Answer · 2013-08-07 18:07:41Z

1

This is really not something that is typically done with regex... and for a good reason, but if you must and since you said there will never be more than a single div within it... this should work for you:

(?<=<div>).*(?=</div>)

answered Aug 7, 2013 at 18:07

Smern

19.1k22 gold badges77 silver badges93 bronze badges

1 Comment

John Dough Over a year ago

This isolates the correct information (toast) but I have one question - If i wanted to return it, what would I have to use on the string? I tried string.split(/(?<=<div>).*(?=<\/div>)/) and string.scan(/(?<=<div>).*(?=<\/div>)/) but neither are correct.

James Lim · Accepted Answer · 2013-08-07 17:57:00Z

1

We need more information. If the string is exactly "<h1>test</h1><div>toast</div>", then something naïve like

regex = /<h1>test<\/h1><div>([^<]*)<\/div>/
found = "<h1>test</h1><div>toast</div>".match(regex)[1]
# => "toast"

would work. My best guess at this point is that you are expecting

<h1>*</h1><div>*</div>

then use this:

regex = /<h1>[^<]*<\/h1><div>([^<]*)<\/div>/
found = "<h1>any string can go here</h1><div>toast</div>".match(regex)[1]
# => "toast"

Note that this breaks if there are any nested elements in either tag. A more robust solution is to use Nokogiri. Talk to your boss.

answered Aug 7, 2013 at 17:57

James Lim

13.1k4 gold badges44 silver badges66 bronze badges

Collectives™ on Stack Overflow

Extract string from HTML tags using RegExp (Ruby)

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related