2

So, I have the following Regex which I use for syntax highlighting:

static Regex cKeyWords = new Regex("(\t|\r\n|\\s|\\(|\\)|^)(auto|break|c(ase|har|onst|ontinue)|d(efaut|ouble)|e(lse|num|xtern)|f(loat|or)|goto|i(f|nt)" +
                                        "|long|re(gister|turn)|s(hort|igned|izeof|tatic|truct|witch)|typedef|u(nion|nsigned)|v(oid|olatile)|while)(?=\t|\r\n|\\s|\\(|\\)|{|}|$)", RegexOptions.Compiled);

It does what I want, but when it comes to large files with about 200,000 characters, it takes a little more than 6 seconds.

If there a way to improve performance?

EDIT: After taking a good look at all the comments/answers/tips, I now have this:

static Regex cKeyWords = new Regex(@"\b(?:
    s(?:hort|i(?:gned|zeof)|t(?:atic|ruct)|witch) | c(?:ase|har|o(?:nst|ntinue)) |
    e(?:lse|num|xtern) | i(?:f|nt) | f(?:loat|or) | d(?:efault|ouble) | un(?:ion|signed) |
    re(?:gister|turn) | vo(?:id|latile) | while | break | long | typedef | auto | goto
    )\b",
    RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

This one can handle the 200,000 characters long text in about 5.5 seconds. It's better. However, I will keep doing some tests to see if I can further reduce time.

4
  • 7
    That regex is optimal (besides using () instead of (?:) but that's a tiny concern). You should rather look into why you're trying to highlight 200kb in one run. No text editor does that. Code editors usually maintain a scope at the start of a line and highlight incrementally. See AvalonEdit's docs. Commented May 4, 2015 at 18:52
  • @LucasTrzesniewski 1) I actually tried using (?:). It actually made highlighting a little bit faster (reduced total time by 0.2s). 2) I guess you are right. Highlighting such a huge file in one run is not good. I'll take a look at the link you provided. Commented May 4, 2015 at 19:06
  • 1
    And as just another optimization for C# regex, use atomic grouping on the whole pattern (enclose in (?>...)) to avoid unnecessary backtracking. Commented May 4, 2015 at 20:03
  • 1
    You have a typo in default (efaut) Commented May 4, 2015 at 20:13

1 Answer 1

4

In my opinion (\t|\r\n|\\s|\\(|\\)|^) at the begining and (?=\t|\r\n|\\s|\\(|\\)|{|}|$) at the end are useless and can be replaced with word boundaries for the same result. (starting a pattern with an alternation is one of the worst thing you should avoid, because the regex engine must test each positions in the string with all alternatives in the worst case)

Use only capture groups when needed because they uses memory and time for nothing. In the present case, you don't need them at all.

So you can rewrite your pattern like this:

static Regex cKeyWords = new Regex(@"\b(?:
    auto | break | c(?:ase|har|onst|ontinue) | d(?:efaut|ouble) |
    e(?:lse|num|xtern) | f(?:loat|or) | goto | i(?:f|nt) | long |
    re(?:gister|turn) | s(?:hort|igned|izeof|tatic|truct|witch) | typedef |
    un(?:ion|signed) | vo(?:id|latile) | while )\b",
     RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

Note that the keyword is now in group 0 (The whole match).

Other things you can try:

  • try to factorize more for example: c(?:ase|har|on(?:st|tinue)) etc.
  • try to not factorize at all.
  • try to sort alternatives by probability (for example there is more words that begins with an "s", so you can try to put s(?:hort|igned|izeof|tatic|truct|witch) at the first place.
  • try to sort alternatives by most frequent keywords.
  • try to add (?=[a-gilr-w]) (so the first letter of all keywords) or at least (?=[a-z]) immediatly after the first word boundary (keep in mind that a word boundary can succeed at a word character position or at a non-word character position). The goal is to avoid to test the alternation when there is not an interesting letter at the word boundary position.
Sign up to request clarification or add additional context in comments.

1 Comment

(\t|\r\n|\\s|\\(|\\)|^) is indeed undesirable, since \t and \n are included in \s. Character class is also preferred over alternation if only a single character is tested, since there is no need to try other characters in the alternation if it is already matched by a character before. Anyway, the solution with \b is the better solution, as presented in the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.