0

I have

tmp_body_symbols="things <st>hello</st> and <st>blue</st> by <st>orange</st>"
str1_markerstring = "<st>"
str2_markerstring = "</st>"
frags << tmp_body_symbols[/#{str1_markerstring}(.*?)#{str2_markerstring}/m, 1]

frags is "hello" but I want ["hello","blue","orange"]

How woudl I do that?

2
  • Isn't an XML parser recommended for such a problem? Commented Jan 28, 2015 at 5:29
  • yeah, was thinking about using nokogiri but this is all that we're really capturing and seems like overkill. Commented Jan 28, 2015 at 5:32

3 Answers 3

3

Use scan:

tmp_body_symbols.scan(/#{str1_markerstring}(.*?)#{str2_markerstring}/m).flatten

See also: Ruby docs for String#scan.

Sign up to request clarification or add additional context in comments.

3 Comments

thx Doorknob, I updated the question from using different variable names but you got it right anyway
Do you need the multiline modifier (/m)?
@CarySwoveland If the tags can potentially contain multiline data (ex. <st>foo\nbar\nbaz</st>), then yes. I've just kept the regex the same as in the original question, to avoid confusion.
2

You can use Nokogiri to parse HTML/XML

require 'open-uri'
require 'nokogiri' 

doc = Nokogiri::HTML::Document.parse("things <st>hello</st> and <st>blue</st> by <st>orange</st>")
doc.css('st').map(&:text)
#=> ["hello", "blue", "orange"]

More Info : http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html

3 Comments

thx, makes sense, might do this down the road but more quick and dirty for this one off
It was just one line but when you need to get data from whole page/file I think this is better to use.
i agree - when it gets there; nokogiri would probably be a great soln
0

You can do this with a capture group, as @Doorknob has done, or without a capture group, by using a ("zero-width") positive look-behind and positive-lookahead:

tmp = "things <st>hello</st> and <st>blue</st> by <st>orange</st>"
s1 = "<st>"
s2 = "</st>"

tmp.scan(/(?<=#{ s1 }).*?(?=#{ s2 })/).flatten
  #=> ["hello", "blue", "orange"]
  • (?<=#{ s1 }), which evaluates to (?<=<st>), is the positive look-behind.
  • (?=#{ s2 }), which evaluates to (?=</st>), is the positive look-behind.
  • ? following .* makes it "non-greedy". Without it:

tmp.scan(/(?<=#{ s1 }).*(?=#{ s2 })/).flatten
  #=> ["hello</st> and <st>blue</st> by <st>orange"] 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.