5

I have a regex that works great(500 nanoseconds) when a match is found, but takes a lot of time (over 3 secs) when there is no match. I suspect this could be because of backtracking. I tried some options, like converting .* to (.*)? based on some documentation, but it didn't help.

Input: a very long string - 5k chars in some cases.

Regex to match: .*substring1.*substring2.*

I am pre-compiling the pattern and re-using the matcher, what else can I try?

Here's my code snippet - I will be calling this method with millions of different input strings, but just a handful of regex patterns.

private static HashMap<String, Pattern> patternMap = new HashMap<String, Pattern>();
private static HashMap<String, Matcher> matcherMap = new HashMap<String, Matcher>();

Here's my method:

public static Boolean regex_match(String line, String regex) {
    if (regex == null || line == null) {
      return null;
    }
    if (!patternMap.containsKey(regex)) {
      patternMap.put(regex, Pattern.compile(regex));
      matcherMap.put(regex,patternMap.get(regex).matcher(""));
    }
    return matcherMap.get(regex).reset(line).find(0);
 }
5
  • 1
    What is your goal here? Do you need to use regex? Commented May 3, 2016 at 15:05
  • please show your code Commented May 3, 2016 at 15:06
  • @Pshemo - yes, I do have to use regex. Commented May 3, 2016 at 15:16
  • 1
    Is there a reason that you need .* before and after? It should be much faster if you use find() instead of match() and get rid of the .* prefix and suffix on your expression. Commented May 3, 2016 at 15:16
  • Do you have these patterns hardcoded or do you build them dynamically? I can suggest using unrolled patterns like substring1[^s]*(?:s(?!ubstring2)[^s]*)*substring2 Commented May 3, 2016 at 19:24

4 Answers 4

5

Your regex is subject to a problem known as catastrophic backtracking, as you hinted at. Essentially, the first .* will match the entire string, and then backtrack until substring1 matches. This will repeat with substring2. Because substring2 fails, the second .* will need to find another place where substring2 begins to match, and then it will fail again. Each time substring1 matches, we need to check every single place that substring2 might match.

You already are using pattern.find(), so you can omit the starting and ending .*. Then, changing the inner .* to a .*? could improve the performance by turning the greedy matcher into a lazy one.

This produces: substring1.*?substring2

Sign up to request clarification or add additional context in comments.

1 Comment

Perfect. This is way more performant than the regex that I had. Thanks for the answer.
2

You can verify that the pattern will match if you use indexOf():

int pos1 = str.indexOf("substring1");
int pos2 = str.indexOf("substring2", pos1);

if(pos1 != -1 && pos2 != -1){
  // regex
}

When the regex doesn't match, you will get catastrophic backtracking. In fact, your pattern is likely doing a lot of backtracking even when there is a match. The .* will eat up the entire string, and then needs to go backwards, reluctantly giving back characters.

If your string looks like: substring1 substring2........50000 more characters......, then you will get better performance with the lazy .*?. Please note that (.*)? is NOT the same as .*?.

The performance of the regex will vary depending on what the substrings are, and what they're matched against. If your string looks like: substring1........50000 more characters...... substring2, then you will get better performance with the .* that you have.

Comments

1

Using String.indexOf() is much faster than Regex if the case is simple enough you can use it. You could recode your problem as:

public static boolean containsStrings(String source, String string1, String string2) {
  long pos1, pos2;
  pos1 = source.indexOf(string1);
  if(pos1 > -1) {
    pos2 = source.indexOf(string2,pos1 + string1.length);
    if(pos2 > pos1 && source.indexOf(string1,pos2 + string2.length) < -1) {
      return true;
    }
  }
  return false;
}

Note that my solution does not deal with the case where string2 is contained in string1, if that is the case you'll need to add that to the logic.

3 Comments

The idea is good, but this will fail if there are two occurrences of string2, one before and one after string1. Better find string1 first and then use its index as start index for the search for string2
Unfortunately, this won't work as my function should be able to handle any regex. Thanks for the answer though.
@user100001 Too bad, in a real life case with 20mb of text I used indexOf() and contains() as much as possible and regex only for the complex cases. The performance gain from using regex sparingly on large documents was several orders of magnitude.
0

^((?!substring1).)*substring1((?!substring2).)*substring2.*?\Z

Should do it because a string that contains one substring multiple times but not both in order won't backtrack ad nauseam. You can drop the .*?\Z at the end if you don't need the matcher to end at end of input.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.