1

I'm trying to match any of the following lines with a regex in python:

RAA RAA

RAA RAA / OOO OOO

RAA RAA / OOO OOO / ROCKY

These strings should always be on their own line so RAA RAA moves over there. wouldn't match.

I came up with this regex using RegExr:

^([A-Z]*([ ]?)*([A-Z]?)*([ \/]?)*)*$

This works fine to match all the different lines however it causes python to hang if it tries to match RAA RAA moves over there.

I've no idea why. Are there any regex experts that might have some insight?

4
  • Define "hangs" - how long did you wait? Also note that one-char character classes are redundant and * implies ? (for instance, ([ ]?)* is \ * sans the backslash, which I was forced to include because markdown sometimes tries too hard not to obscure non-mark'd-up text). Commented Apr 20, 2011 at 16:25
  • 4
    Are you just trying to match lines consisting only of upper case letters, forward slashes and spaces? It's not clear to me what property you are after? Commented Apr 20, 2011 at 16:26
  • 3
    You've said "match any of these (three) lines", and then given us a regexp which matches much more. Please be more specific about the requirements. Commented Apr 20, 2011 at 16:30
  • -1 for again giving us only specific examples and no pattern or description in a question about regex. Commented Apr 20, 2011 at 18:05

2 Answers 2

2

That regex is far too general: not only does it match more than you want, but it has so many *s that the regex matcher will constantly be pointlessly backtracking to try some other combination. I haven't tried to work the combinatorial tree, but it's at least several thousand attempts per non-matching line.

Specific is better, and making sure you don't backtrack over what you're committed to is better:

^RAA RAA(?: \/ OOO OOO(?: \/ ROCKY)?)?$

If the substrings aren't constant, you should specify them as completely as possible to avoid unnecessary backtracking.

(The ?: are another small optimization: don't record the parenthesized matches for later extraction. If you do need the substrings, my guess is you don't want the /s with them, so capture just the parts you want.)

Sign up to request clarification or add additional context in comments.

Comments

0

Your entire pattern is full of optional matches, which is likely causing lots of backtracking, and thus the hanging experience. Try using a mandatory match where it makes sense, such as:

^([A-Z]+([ ]?)+([A-Z])*([ /])*)*$

A cleaner pattern, without the unnecessary capturing groups, would be:

^([A-Z]+[ ]?)+([A-Z]+[ /]*)*$

Notice that the use of + instead of * ensures that at least one character must match, rather than making the entire pattern optional and taxing the regex engine.

1 Comment

This is definitely what I need. My regex skills are woefully inadequate so this advice is much appreciated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.