How can I improve this small Ruby Regex snippet?

Question

How can I improve this?

the purpose of this code is to be used in a method that captures a string of hash_tags #twittertype from a form - parse through the list of words and make sure all the words are separated out.

WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats    mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
SECOND_TEST = 'orion#Orion#oRion,Mike'

This is my problem area RegXps...

_string_rgx = /([a-zA-Z0-9]+(-|_)?\w+|#?[a-zA-Z0-9]+(-|_)?\w+)/

add_pound_sign = lambda { |a| a[0].chr == '#' ? a : a='#' + a; a}

I don't know that much Regular Expressions: hence the needed collect the first[element] from the result of the scan -> It yielded weird stuff but the first element was always what I wanted.

 t_word = WORD_TEST.scan(_string_rgx).collect {|i| i[0] }
 s_word = SECOND_TEST.scan(_string_rgx).collect {|i| i[0] }
 t_word.map! { |a| a = add_pound_sign.call(a); a }
 s_word.map! { |a| a = add_pound_sign.call(a); a }

The results are what I want. I just want insight from Ruby | Regex guru's out there.

puts t_word.inspect

[ 
"#123", "#sunset", "#2d2-apple", "#home", "#star", "#Babyclub", 
"#apple_surprise", "#apple", "#cats", "#mustard", "#dog", 
"#basic_cable", "#safety", "#222", "#dog-D", "#DOG", "#2D"
]

puts s_word.inspect

[
"#orion", "#Orion", "#oRion", "#Mike"
]

Thanks in advance.

amon · Accepted Answer · 2012-07-31 19:35:21Z

2

Lets unfold the regex:

(
   [a-zA-Z0-9]+ (-|_)? \w+
   | #? [a-zA-Z0-9]+ (-|_)? \w+
)

( begin capture group

[a-zA-Z0-9]+ match one or more alphanumeric characters

(-|_)? match a hyphen or an underscore and save. This group may fail

\w+ match one or more "word" characters (alphanumeric + underscore)

| OR match this:

#? match optional # character

[a-zA-Z0-9]+ match one or more alphanumeric characters

(-|_)? match hyphen or underscore and capture. may fail.

\w+ match one or more word characters

) end capature

I'd rather write this regex like this;

(#? [a-zA-Z0-9]+ (-|_)? \w+)

or

( #? [a-zA-Z0-9]+ (-?\w+)? )

or

( #? [a-zA-Z0-9]+ -? \w+ )

(all are reasonably equivalent)

You should note that this regex will fail on hashtags with unicode characters, eg #Ü-Umlaut, #façadeetc. You are also limited to a two-character minimum length (#a fails, #ab matches) and may have only one hyphen (#a-b-c fails / would return #a-b)

edited Jul 31, 2012 at 19:35

answered Jul 31, 2012 at 19:23

amon

57.8k2 gold badges93 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Orion Engleton Over a year ago

Thank you for breaking this down for me.

GDP · Accepted Answer · 2012-07-31 20:30:57Z

0

I would reduce your Regex pattern such as this:

WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats    mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
foo = []
WORD_TEST.scan(/#?[-\w]+\b/) do |s|
    foo.push( s[0] != '#' ? '#' + s : s )
end

edited Jul 31, 2012 at 20:30

GDP

8,1636 gold badges48 silver badges84 bronze badges

answered Jul 31, 2012 at 19:49

godspeedlee

6723 silver badges7 bronze badges

2 Comments

Orion Engleton Over a year ago

This is pretty slick - Though I wish I understood how the regex is broken down. especially the [-\w]+\b

godspeedlee Over a year ago

for [-\w], regex can accept '-' becomes first character in the character set([...]). \b is word boundary meta character. :)

Collectives™ on Stack Overflow

How can I improve this small Ruby Regex snippet?