1

I would like to extract "toast" from a string <h1>test</h1><div>toast</div>. What regular expression could isolate such a string?

Edit: Thanks to the user who who corrected the formatting.

More Info: There will always only be one instance of the div tag, the information inside may change but there will never be another div tag in the same string (the string is larger than the given sample)

Thanks!

8
  • based on what? do you just want all text within any div? this is probably best to do with some sort of dom parser rather than regex. Commented Aug 7, 2013 at 17:47
  • @smerny sorry, I fixed the question. My boss is requiring me to use regex, so I have no choice :/ Commented Aug 7, 2013 at 17:50
  • Nokogiri is the best tool to parse the HTML and XML stuffs.. Commented Aug 7, 2013 at 17:52
  • We need more information. Which part of the string is variable? For example, a naive solution could be regex = /<h1>test<\/h1><div>([^<]*)<\/div>/ Commented Aug 7, 2013 at 17:54
  • Well, this is just a small part of the entire string so no easy solutions work unfortunately (I tried those, but the regex is way too clunky). All the tags will always remain the same, it's the content inside (i.e. "toast") that will change Commented Aug 7, 2013 at 18:02

3 Answers 3

6

You can use Nokogiri.

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse("<div> test </div> <div> toast </div>")
doc.css('div').map(&:text)
# => [" test ", " toast "]

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse("<h1>test</h1><div>toast</div>")
doc.at_css('div').text
# => "toast"
Sign up to request clarification or add additional context in comments.

2 Comments

Sorry, I fixed the question. This shouldn't be that complicated, right?
Using an HTML parser is not complicated. Dealing with changes in your data that you don't expect, but are still perfectly valid HTML, is what is complicated. A little time spent up front with a proper HTML parser will save you hours of debugging and heartache down the road.
1

This is really not something that is typically done with regex... and for a good reason, but if you must and since you said there will never be more than a single div within it... this should work for you:

(?<=<div>).*(?=</div>)

1 Comment

This isolates the correct information (toast) but I have one question - If i wanted to return it, what would I have to use on the string? I tried string.split(/(?<=<div>).*(?=<\/div>)/) and string.scan(/(?<=<div>).*(?=<\/div>)/) but neither are correct.
1

We need more information. If the string is exactly "<h1>test</h1><div>toast</div>", then something naïve like

regex = /<h1>test<\/h1><div>([^<]*)<\/div>/
found = "<h1>test</h1><div>toast</div>".match(regex)[1]
# => "toast"

would work. My best guess at this point is that you are expecting

<h1>*</h1><div>*</div>

then use this:

regex = /<h1>[^<]*<\/h1><div>([^<]*)<\/div>/
found = "<h1>any string can go here</h1><div>toast</div>".match(regex)[1]
# => "toast"

Note that this breaks if there are any nested elements in either tag. A more robust solution is to use Nokogiri. Talk to your boss.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.