0

I want to make an array of results from a string like this one, using a regular expression:

results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday

Here’s my regex as it stands. It works in Sublime Text’s regex search but not in Ruby:

(results)\|.*?\\n(?=((results\|)|(timestamps\|\|)))

and this would be the desired result:

1. results|foofoofoo
2. results|barbarbar
3. results|googoogoo

Instead I’m getting these weird returns, and I can’t understand it. Why does this not select the result lines?

Match 1
1. results
2. results|
3. results|
4.  

Match 2
1. results
2. results|
3. results|
4.   

Match 3
1. results
2. timestamps||
3.  
4. timestamps||

Here’s the actual code using the regex:

#create new lines for each regex'd line body with that body set as the raw attribute
host_scan.raw.scan(/(?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|)))/).each do |body|
  @lines << Line.new({:raw => body})
end
5
  • First: What does the code look like? Second: Why do you have a \\n (\n) in there? There is no \n in your string. (Neither is there a newline.) Commented May 25, 2012 at 18:13
  • Why use a regex when the more obvious choice is to use split("\n") and then use array slices or individual array indexes? Commented May 25, 2012 at 19:07
  • @theTinMan There are some complications. As far as \n is concerned, for example, there are lots more \n instances in each "result". I feel I'm very close to doing this the "right" way with this regex, it's just not returning the same way it does in Sublime Text. Commented May 25, 2012 at 19:14
  • If your example is too simple we can't help you very well. It has to be accurate enough to give us a feel for the problem. Commented May 25, 2012 at 19:19
  • @theTinMan Sorry, I've provided the relevant code above. Commented May 25, 2012 at 19:24

4 Answers 4

1

As Kendall Frey already stated, you are creating too many capture groups. No need to group the first literal “results|”, and no need to group the elements of your alternate group in individual non backreferencing groups. What you are intending to do is this regex:

/results\|.*?(?=\\n(?:results\||timestamps\|\|))/

or, if you don’t mind repeating the \\n part, you can do away with the non-capturing subgroup:

/results\|.*?(?=\\nresults\||\\ntimestamps\|\|)/

– both will return an array of matched values as specified in your question.

Sign up to request clarification or add additional context in comments.

1 Comment

Also note that this won’t match a “results|foobargoo” string at the end of your line. If you need that one too, the regex is /results\|.*?(?=\\n(?:results\||timestamps\|\|)|$)/ (or /results\|.*?(?=\\nresults\||\\ntimestamps\|\||$)/ if you go with the second variant).
0

I'm guessing it has something to do with capturing groups. If you change all your (...) to (?:...) it will eliminate capturing groups.

6 Comments

ok that got me a little closer. (?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|))) returns three matches, although they are only the endings of each group (results| results| and timestamps||)
Thanks. When I put that one in, though, I get no matches at all.
I'm not sure, but I'm starting to think I understand what (?:...) does. It seems to be returning the last appearance of it as the match, because all my matches are results| or timestamps||. Is this correct?
No, it just specifies a non-capturing group.
code posted. Thanks for putting up with my inexperience here, I'm still learning the etiquette.
|
0

Rather than jump to a regex, which is a much more complicated way to get at the data, use split("\n").

text = "results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday"
ary = text.split("\n")

ary is:

[
  "results|foofoofoo",
  "results|barbarbarbar",
  "results|googoogoo",
  "timestamps||friday"
]

Slice that and you can get:

ary[0..2]
=> ["results|foofoofoo", "results|barbarbarbar", "results|googoogoo"]

EDIT:

Based on the comment that there are more carriage returns and complex characters in the strings:

require 'awesome_print'

text = "results|foofoofoo\nmorefoo\nandevenmorefoo\nresults|barbarbarbar\nandmorebar\nandyetagainmorebar\nresults|googoogoo\ntimestamps||friday"
ap text.sub(/\|\|friday$/, '').split('results')[1..-1].map{ |l| 'results' << l }

Which outputs:

[
  [0] "results|foofoofoo\nmorefoo\nandevenmorefoo\n",
  [1] "results|barbarbarbar\nandmorebar\nandyetagainmorebar\n",
  [2] "results|googoogoo\ntimestamps"
]

4 Comments

I couldn't put the actual string in here for security reasons (they're vulnerability scans), but split("\n") won't work because the portions I'm trying to select contain \n, not to mention a lot of other stuff. Sorry that I didn't specify that. Your solution would be an elegant one if not for that.
Something more complex than what's in the added sample above?
Yes, I have paragraphs of unpredictable text, sometimes containing the words results, |, or \n. This method of dividing it appears to be the only way that will work, I just have to figure out how to get all the text back instead of those strange partial returns.
By providing some reasonable facsimiles of the data you are working on we can do a lot better job of providing answers. You can cleanse the data of anything that would be sensitive easily enough, just provide samples of the sort of "unpredictable text" you encounter.
0

The answer turned out to lie in the parentheses. Wrapping in parentheses caused it to return the entire match instead of just the tail delimiter.

host_scan.raw.scan(/((?:results\|.*?\\n)(?=(?:results\|)|(?:timestamps\|\|)))/).each do |body|
      @lines << Line.new({:raw => body})
end

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.