string parsing optimization : ruby

Question

I am working on a parser that is currently way too slow for my needs (like 40x slower than I would like) and would like advice on methods to increase my speed. I have tried and am currently using a custom regex parser, aswell as a custom parser using strscanner class. Ive heard a lot of positive comments on treetop, and have considered trying to combine the regex into one huge regex that would cover all matches, but would like to get some feedback w/ experience before I rewrite my parser yet again.

The basic rules of the strings that I am parsing are:

3 segments (BoL operators, message, EoL operators)
~6 BoL operators BoL operators can be in any order
2 EoL operators EoL operators can be in any order
Quantity of any specific operator can be 0, 1, or >1 but only 1 is used rest are removed and discarded
Operators in the 'message' section of the string are not captured / removed
Whitespaces is allowed before & after operators but not required
Some BoL operators can have whitespace in the setting

My current Regex parser works by running the string through a loop that checks for BoL or EoL operators 1 at a time and cutting them out, ending the loop when there are no more operators of the given type as so...

loop{ 
if input =~ /^\s+/ then input.gsub!(/^\s+/,'') end
if input =~ /reges for operator_a/ #sets 
   sets operator_a
   input.gsub!(/regex for operator_a)/, '')
elsif input =~ /regex for operator_b/ 
   sets operator_b
   input.gsub!(/regex for operator_b/,'')
elsif input =~ /regex for operator_c/
   sets operator_c
   etc .. etc .. etc..
else
break
end
}

The question I have, What would be the best way to optimize this code? Treetop, another library/gem that I have not found yet, combining the loops into one huge regex, something else?

Please restrict all answers and input to the Ruby language, I know that it is not the 'best' tool for this job, it is the language that I use.

More specific grammer / examples if that helps. This is for parsing communication commands sent to a game by users, so far the only commands are say, and whisper. The begenning of line operators accepted are ::{target}, :{adverb}, ={verb}, and #{direction of}. The end of line operators are {emoticon (aka. :D :( :)}, which sets adverb if not already set, and end of line puncutation which sets verb if not already set. the character ' is an alias for say, and sayto is an alias for say:: examples :

':happy::my sword=as# my helm Bol command operators work.

{:action=>:say, :adverb=>"happily", :verb=>"ask", :direction=>"my helm", :message=>"Bol command operators work."}

say yep say works

{:action=>:say, :message=>" yep say works"}

sayto my sword yep sayto works as do EoL operators!:)

{:action=>:say, :target=>"my sword", :adverb=>"happily", :verb=>"say", :message=>"yep sayto works as do EoL operators!"}

whisper::my friend : happy Bol command operators work with whisper.

{:action=>:whisper, :target=>"my friend", :adverb=>"happily", :message=>"Bol command operators work with whisper."}

whisp:happy::tinkerbell and they work in a different order.

{:action=>:whisper, :adverb=>"happily", :target=>"tinkerbell", :message=>"and they work in a different order."}

':bash=exclaim::hammer BoL operators work in this order too.

{:action=>:say, :adverb=>"bashfully", :verb=>"exclaim", :target=>"hammer", :message=>"BoL operators work in this order too."}

sayto bells =say :sad #wontwork Bol > Eol and directed !work with directional? :)

{:action=>:say, :verb=>"say", :adverb=>"sadly", :direction=>"wontwork", :message=>"Bol > Eol and directed !work with directional?"}

'all EoL removed closest to end used and reinserted. !!??!?....... :) ? :(

{:action=>:say, :adverb=>"sadly", :verb=>"ask", :message=>"all EoL removed closest to end used and reinserted?"}

A small improvement might be had by extracting the regexps into variables before the loop: op_a_re = /regex for operator_a/; loop { ... input ~= op_a_re ... }; this way you call implicit Regexp.new once, instead of once per loop iteration. Although, my admittedly very very simple benchmark only sped up by 5% on 1.9.2.. — Amadan
– Amadan, Commented Dec 22, 2011 at 18:13
@Amadan Very good point! I had thought of doing that when I got to the code 'clean-up' phase, but hadn't thought about it affecting speed aswell. TY — JosephRuby
– JosephRuby, Commented Dec 22, 2011 at 18:16
Without knowing much about what you're trying to parse, the question is difficult to answer. It sounds more like a real grammar would be beneficial, but that doesn't necessarily mean it would be faster. Without examples, it's tricky to theorize. — Dave Newton
– Dave Newton, Commented Dec 22, 2011 at 21:27
@JosephRuby Yep, and the grammar is still unclear. Grammars are most easily understood when documented generically. No suggestions at this point, although it looks like some splitting and checking for command-ness might be enough, or switch to something like treetop. Can't tell if the grammar is regular-enough to benefit from TT though. — Dave Newton
– Dave Newton, Commented Dec 22, 2011 at 21:39
@JosephRuby Still looks like a more-naive split would work, but w/o spending more time on it, not sure. It might even be doable w/ an internal DSL if you make a few concessions, or possibly a thin layer over an internal DSL making a real grammar much easier. — Dave Newton
– Dave Newton, Commented Dec 22, 2011 at 22:03

steenslag · Accepted Answer · 2011-12-22 21:57:19Z

3

Maybe this syntax is useful in your case:

emoti_convert = { ":)" => "happily", ":(" => "sadly" }
re_emoti = Regexp.union(emoti_convert.keys)
str = "It does not work :(. Oh, it does :)!"

p str.gsub(re_emoti, emoti_convert)
#=> "It does not work sadly. Oh, it does happily!"

But if you are trying to define a grammar, this is not the way to go (agreeing with @Dave Newton's comments).

edited Dec 22, 2011 at 21:57

answered Dec 22, 2011 at 21:47

steenslag

80.2k16 gold badges144 silver badges174 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JosephRuby Over a year ago

I honestly didn't know that a regex'ed hash could be used in a gsub like that very nice trick! Don't think it will help in this case though.

Collectives™ on Stack Overflow

string parsing optimization : ruby

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related