0

How can I improve this?

the purpose of this code is to be used in a method that captures a string of hash_tags #twittertype from a form - parse through the list of words and make sure all the words are separated out.

WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats    mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
SECOND_TEST = 'orion#Orion#oRion,Mike'

This is my problem area RegXps...

_string_rgx = /([a-zA-Z0-9]+(-|_)?\w+|#?[a-zA-Z0-9]+(-|_)?\w+)/

add_pound_sign = lambda { |a| a[0].chr == '#' ? a : a='#' + a; a}

I don't know that much Regular Expressions: hence the needed collect the first[element] from the result of the scan -> It yielded weird stuff but the first element was always what I wanted.

 t_word = WORD_TEST.scan(_string_rgx).collect {|i| i[0] }
 s_word = SECOND_TEST.scan(_string_rgx).collect {|i| i[0] }
 t_word.map! { |a| a = add_pound_sign.call(a); a }
 s_word.map! { |a| a = add_pound_sign.call(a); a }

The results are what I want. I just want insight from Ruby | Regex guru's out there.

puts t_word.inspect

[ 
"#123", "#sunset", "#2d2-apple", "#home", "#star", "#Babyclub", 
"#apple_surprise", "#apple", "#cats", "#mustard", "#dog", 
"#basic_cable", "#safety", "#222", "#dog-D", "#DOG", "#2D"
]

puts s_word.inspect

[
"#orion", "#Orion", "#oRion", "#Mike"
]

Thanks in advance.

2 Answers 2

2

Lets unfold the regex:

(
   [a-zA-Z0-9]+ (-|_)? \w+
   | #? [a-zA-Z0-9]+ (-|_)? \w+
)

( begin capture group

[a-zA-Z0-9]+ match one or more alphanumeric characters

(-|_)? match a hyphen or an underscore and save. This group may fail

\w+ match one or more "word" characters (alphanumeric + underscore)

| OR match this:

#? match optional # character

[a-zA-Z0-9]+ match one or more alphanumeric characters

(-|_)? match hyphen or underscore and capture. may fail.

\w+ match one or more word characters

) end capature

I'd rather write this regex like this;

(#? [a-zA-Z0-9]+ (-|_)? \w+)

or

( #? [a-zA-Z0-9]+ (-?\w+)? )

or

( #? [a-zA-Z0-9]+ -? \w+ )

(all are reasonably equivalent)

You should note that this regex will fail on hashtags with unicode characters, eg #Ü-Umlaut, #façadeetc. You are also limited to a two-character minimum length (#a fails, #ab matches) and may have only one hyphen (#a-b-c fails / would return #a-b)

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for breaking this down for me.
0

I would reduce your Regex pattern such as this:

WORD_TEST = "123 sunset #2d2-apple,#home,#star #Babyclub, #apple_surprise #apple,cats    mustard#dog , #basic_cable safety #222 #dog-D#DOG#2D "
foo = []
WORD_TEST.scan(/#?[-\w]+\b/) do |s|
    foo.push( s[0] != '#' ? '#' + s : s )
end

2 Comments

This is pretty slick - Though I wish I understood how the regex is broken down. especially the [-\w]+\b
for [-\w], regex can accept '-' becomes first character in the character set([...]). \b is word boundary meta character. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.