2

I'm writing a c# program to update the starting comment -that is commonly the license header- of java source code. The following snippet do the job.

                foreach (string r in allfiles)
                {
                    // GC.Collect();
                    string thefile = System.IO.File.ReadAllText(r);
                    var pattern = @"/\*(?s:.*?)\*/[\s\S]*?package";
                    Regex regex1 = new Regex(pattern /*,RegexOptions.Compiled */) ;
                    var replaced = regex1.Replace(thefile, newheader + "package");
                    System.IO.File.WriteAllText(r, replaced);
                }

The problem is that after hundreds of source file processed the process hang at .Replace

It's not a matter of Garbage Collection as forcing it don't solve the issue. And doesn't matter if RegexOptions.Compiled or not.

I'm quite sure it depends on an issue in the pattern as the hanging appear on some files that -if removed from processing- let the job continue till the end of one thousand of source file. But if I process these files alone, it work and also work if I use an online testing tool as http://regexstorm.net/tester https://www.myregextester.com/index.php

Please let me know if there is any way to optimize better the search pattern for finding the first Java comment in a file.

Thank you in advance.

6
  • What is the input? What is newheader? The pattern is poorly written, you can re-write it as @"(?s)/\*[^*]*\*/.*?package" and can be further improved, but without sample input (what string it hangs with) it is difficult to help effeciently. Commented Nov 9, 2015 at 14:21
  • Just find an example of hanging with public available java source code github.com/eclipse/jgit/blob/master/org.eclipse.jgit/src/org/… that can be used as sample input. Here the crash is deterministic. Commented Nov 9, 2015 at 14:45
  • newheader in my test was simply string.empty. It's the new header, It can be everything. It's a licence statement so for example: /* * Copyright (C) 2015, somebody at somecompany * and other copyright owners as documented in the project's files. */ Commented Nov 9, 2015 at 14:59
  • Try @"/\*[^*]*(?:\*(?!/)[^*]*)*\*/\s*package" regex. Or even @"/\*[^*]*(?:\*(?!/)[^*]+)*\*/\s*package". Commented Nov 9, 2015 at 15:06
  • stribizhev your comment 1st regex is a valid solution, as solve the issue on all files I have tested and is also very fast. I will give a look also in your second regex. If you post as answer, I will +1 it. Commented Nov 9, 2015 at 15:29

1 Answer 1

1

Your regex contains 2 bottlenecks related to lazy dot matching (. in singleline mode and [\s\S]*? are synonyms). The backtracking buffer may get easily and quickly overrun when running a regex against big files.

The common technique is to unroll/unwrap the construct with the negated character class and a quantified group.

You may use

@"/\*[^*]*(?:\*(?!/)[^*]*)*\*/\s*package"

See regex demo

The regex breakdown:

  • /\* - literal /*
  • [^*]* - 0 or more characters other than *
  • (?:\*(?!/)[^*]*)* - the unrolled variant of (?s:.*?), matching 0 or more sequences of...
    • \*(?!/) - a * symbol not followed by a /
    • [^*]* - 0 or more symbols other than *
  • \*/ - a literal sequence of */
  • \s* - 0 or more whitespace characters
  • package - literal letter sequence package
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.