0

I have a regular expression in C#, where the pattern is 8000+ words (or groups of words) each separated by word boundaries, i.e.:

"\\bword1\\b|\\bword2 word3 word4\\b|.......etc"

I am trying to match a word (or group of words) in a string to any word (or group of words) in this expression. It all works fine, except I find that on average it takes 37ms to complete the operation.

Interestingly, if I do the same thing but using String.IndexOf and some convoluted methods it does run substantially quicker (but still far too slow), which I find odd.

I am aware of other regular expression engines in particular re2/google but am really keen to use C# built in functionality where possible.

If anyone has advice it would be appreciated.

10
  • those \b's have two slashes in real life... they just got escaped when posting! Commented Jan 2, 2015 at 17:37
  • 1
    Did you use RegexOptions.Compiled? Tohugh 8000+ Word pattern is pretty overkill i guess. Commented Jan 2, 2015 at 17:38
  • @james That's why there is a "format as code" option in the editor. ;) Commented Jan 2, 2015 at 17:39
  • Did you use the compile option in the RegexOptions? for use in the constructor, var regex= new Regex(string,RegexOptions) Commented Jan 2, 2015 at 17:40
  • 3
    That is a really weird use of regexps, why not to convert your regexp to string array/dictionary and use collection methods to search Commented Jan 2, 2015 at 17:41

2 Answers 2

4

To understand why your regular expression is slow you must simply envision how regular expressions work.

In your case (8000 alternatives)

  • Step 1. Start at input character 0.
  • Step 2. Try to match alternative 0. Oops, no match.
  • Step 3. Try to match alternative 1. Oops, no match.
  • ...
  • Step 4000. Try to match alternative 4001. Yay! First character matches!
  • Step 4001. Try to match alternative 4001. Yay! Second character matches!
  • Step 4002. Try to match alternative 4001. Oops, no match.
  • Step 4003. Try to match alternative 4002. Oops, no match.
  • ...
  • Step 8963. Try to match alternative 7999. Oops, no match.
  • Step 8964. Failed all alternatives at input character 0.
  • Step 8965. Move to input character 1.
  • Step 8966. Try to match alternative 0. Oops, no match.
  • ...

The longer your input string is, the more alternatives your regex has and the more "almost, but not quite" matches occur in your input string, the slower this will get.

If you can make it faster with String.IndexOf(), do it. You will never make it faster with regex.

Explore other ways of searching strings for words. Which one would work for you strongly depends on how your input looks like.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, I will keep trying to find ways... obviously one way is to reduce the number of words in the regex but I am struggling to do that at the moment (long story)
I think you are at a point where regex no longer is a strategy worth following.
Believe this can also be called "Catastrophic Backtracking"
No, that's not catastrophic backtracking. This expression runs at optimal speed, it's just that the optimum isn't exactly fast in this use case.
0

I would suggest switching to HashSet, because it seems better suited to your task.

Here is a draft of what you could do:

// produce string[] containing your words
var words = myRegexp.split("\\b|\\");

var mySet = new HashSet<string>(words);
// usage
mySet.Contains("find this");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.