Regex returning weird arrays

Question

I want to make an array of results from a string like this one, using a regular expression:

results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday

Here’s my regex as it stands. It works in Sublime Text’s regex search but not in Ruby:

(results)\|.*?\\n(?=((results\|)|(timestamps\|\|)))

and this would be the desired result:

1. results|foofoofoo
2. results|barbarbar
3. results|googoogoo

Instead I’m getting these weird returns, and I can’t understand it. Why does this not select the result lines?

Match 1
1. results
2. results|
3. results|
4.  

Match 2
1. results
2. results|
3. results|
4.   

Match 3
1. results
2. timestamps||
3.  
4. timestamps||

Here’s the actual code using the regex:

#create new lines for each regex'd line body with that body set as the raw attribute
host_scan.raw.scan(/(?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|)))/).each do |body|
  @lines << Line.new({:raw => body})
end

First: What does the code look like? Second: Why do you have a \\n (\n) in there? There is no \n in your string. (Neither is there a newline.) — Kendall Frey
– Kendall Frey, Commented May 25, 2012 at 18:13
Why use a regex when the more obvious choice is to use split("\n") and then use array slices or individual array indexes? — the Tin Man
– the Tin Man, Commented May 25, 2012 at 19:07
@theTinMan There are some complications. As far as \n is concerned, for example, there are lots more \n instances in each "result". I feel I'm very close to doing this the "right" way with this regex, it's just not returning the same way it does in Sublime Text. — blaha
– blaha, Commented May 25, 2012 at 19:14
If your example is too simple we can't help you very well. It has to be accurate enough to give us a feel for the problem. — the Tin Man
– the Tin Man, Commented May 25, 2012 at 19:19

Community · Accepted Answer · 2017-05-23 12:11:28Z

1

As Kendall Frey already stated, you are creating too many capture groups. No need to group the first literal “results|”, and no need to group the elements of your alternate group in individual non backreferencing groups. What you are intending to do is this regex:

/results\|.*?(?=\\n(?:results\||timestamps\|\|))/

or, if you don’t mind repeating the \\n part, you can do away with the non-capturing subgroup:

/results\|.*?(?=\\nresults\||\\ntimestamps\|\|)/

– both will return an array of matched values as specified in your question.

edited May 23, 2017 at 12:11

CommunityBot

11 silver badge

answered May 25, 2012 at 23:16

kopischke

3,4131 gold badge24 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kopischke Over a year ago

Also note that this won’t match a “results|foobargoo” string at the end of your line. If you need that one too, the regex is /results\|.*?(?=\\n(?:results\||timestamps\|\|)|$)/ (or /results\|.*?(?=\\nresults\||\\ntimestamps\|\||$)/ if you go with the second variant).

Kendall Frey · Accepted Answer · 2012-05-25 18:17:13Z

0

I'm guessing it has something to do with capturing groups. If you change all your (...) to (?:...) it will eliminate capturing groups.

answered May 25, 2012 at 18:17

Kendall Frey

44.6k21 gold badges113 silver badges151 bronze badges

6 Comments

blaha Over a year ago

ok that got me a little closer. (?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|))) returns three matches, although they are only the endings of each group (results| results| and timestamps||)

blaha Over a year ago

Thanks. When I put that one in, though, I get no matches at all.

blaha Over a year ago

I'm not sure, but I'm starting to think I understand what (?:...) does. It seems to be returning the last appearance of it as the match, because all my matches are results| or timestamps||. Is this correct?

Kendall Frey Over a year ago

No, it just specifies a non-capturing group.

blaha Over a year ago

code posted. Thanks for putting up with my inexperience here, I'm still learning the etiquette.

|

the Tin Man · Accepted Answer · 2012-05-26 08:14:28Z

0

Rather than jump to a regex, which is a much more complicated way to get at the data, use split("\n").

text = "results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday"
ary = text.split("\n")

ary is:

[
  "results|foofoofoo",
  "results|barbarbarbar",
  "results|googoogoo",
  "timestamps||friday"
]

Slice that and you can get:

ary[0..2]
=> ["results|foofoofoo", "results|barbarbarbar", "results|googoogoo"]

EDIT:

Based on the comment that there are more carriage returns and complex characters in the strings:

require 'awesome_print'

text = "results|foofoofoo\nmorefoo\nandevenmorefoo\nresults|barbarbarbar\nandmorebar\nandyetagainmorebar\nresults|googoogoo\ntimestamps||friday"
ap text.sub(/\|\|friday$/, '').split('results')[1..-1].map{ |l| 'results' << l }

Which outputs:

[
  [0] "results|foofoofoo\nmorefoo\nandevenmorefoo\n",
  [1] "results|barbarbarbar\nandmorebar\nandyetagainmorebar\n",
  [2] "results|googoogoo\ntimestamps"
]

edited May 26, 2012 at 8:14

answered May 25, 2012 at 19:14

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

4 Comments

blaha Over a year ago

I couldn't put the actual string in here for security reasons (they're vulnerability scans), but split("\n") won't work because the portions I'm trying to select contain \n, not to mention a lot of other stuff. Sorry that I didn't specify that. Your solution would be an elegant one if not for that.

the Tin Man Over a year ago

Something more complex than what's in the added sample above?

blaha Over a year ago

Yes, I have paragraphs of unpredictable text, sometimes containing the words results, |, or \n. This method of dividing it appears to be the only way that will work, I just have to figure out how to get all the text back instead of those strange partial returns.

the Tin Man Over a year ago

By providing some reasonable facsimiles of the data you are working on we can do a lot better job of providing answers. You can cleanse the data of anything that would be sensitive easily enough, just provide samples of the sort of "unpredictable text" you encounter.

blaha · Accepted Answer · 2012-05-29 13:46:38Z

0

The answer turned out to lie in the parentheses. Wrapping in parentheses caused it to return the entire match instead of just the tail delimiter.

host_scan.raw.scan(/((?:results\|.*?\\n)(?=(?:results\|)|(?:timestamps\|\|)))/).each do |body|
      @lines << Line.new({:raw => body})
end

answered May 29, 2012 at 13:46

blaha

2,7654 gold badges18 silver badges19 bronze badges

Collectives™ on Stack Overflow

Regex returning weird arrays

4 Answers 4

1 Comment

6 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

6 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related