Regex pattern match performance in Java for long string

Question

I have a regex that works great(500 nanoseconds) when a match is found, but takes a lot of time (over 3 secs) when there is no match. I suspect this could be because of backtracking. I tried some options, like converting .* to (.*)? based on some documentation, but it didn't help.

Input: a very long string - 5k chars in some cases.

Regex to match: .*substring1.*substring2.*

I am pre-compiling the pattern and re-using the matcher, what else can I try?

Here's my code snippet - I will be calling this method with millions of different input strings, but just a handful of regex patterns.

private static HashMap<String, Pattern> patternMap = new HashMap<String, Pattern>();
private static HashMap<String, Matcher> matcherMap = new HashMap<String, Matcher>();

Here's my method:

public static Boolean regex_match(String line, String regex) {
    if (regex == null || line == null) {
      return null;
    }
    if (!patternMap.containsKey(regex)) {
      patternMap.put(regex, Pattern.compile(regex));
      matcherMap.put(regex,patternMap.get(regex).matcher(""));
    }
    return matcherMap.get(regex).reset(line).find(0);
 }

Is there a reason that you need .* before and after? It should be much faster if you use find() instead of match() and get rid of the .* prefix and suffix on your expression. — Chill
– Chill, Commented May 3, 2016 at 15:16
Do you have these patterns hardcoded or do you build them dynamically? I can suggest using unrolled patterns like substring1[^s]*(?:s(?!ubstring2)[^s]*)*substring2 — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 3, 2016 at 19:24

Strikeskids · Accepted Answer · 2016-05-03 15:33:24Z

5

Your regex is subject to a problem known as catastrophic backtracking, as you hinted at. Essentially, the first .* will match the entire string, and then backtrack until substring1 matches. This will repeat with substring2. Because substring2 fails, the second .* will need to find another place where substring2 begins to match, and then it will fail again. Each time substring1 matches, we need to check every single place that substring2 might match.

You already are using pattern.find(), so you can omit the starting and ending .*. Then, changing the inner .* to a .*? could improve the performance by turning the greedy matcher into a lazy one.

This produces: substring1.*?substring2

answered May 3, 2016 at 15:33

Strikeskids

4,08215 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user100001 Over a year ago

Perfect. This is way more performant than the regex that I had. Thanks for the answer.

Laurel · Accepted Answer · 2016-05-03 15:40:50Z

You can verify that the pattern will match if you use indexOf():

int pos1 = str.indexOf("substring1");
int pos2 = str.indexOf("substring2", pos1);

if(pos1 != -1 && pos2 != -1){
  // regex
}

When the regex doesn't match, you will get catastrophic backtracking. In fact, your pattern is likely doing a lot of backtracking even when there is a match. The .* will eat up the entire string, and then needs to go backwards, reluctantly giving back characters.

If your string looks like: substring1 substring2........50000 more characters......, then you will get better performance with the lazy .*?. Please note that (.*)? is NOT the same as .*?.

The performance of the regex will vary depending on what the substrings are, and what they're matched against. If your string looks like: substring1........50000 more characters...... substring2, then you will get better performance with the .* that you have.

Michael Shopsin · Accepted Answer · 2016-05-03 15:33:46Z

1

Using String.indexOf() is much faster than Regex if the case is simple enough you can use it. You could recode your problem as:

public static boolean containsStrings(String source, String string1, String string2) {
  long pos1, pos2;
  pos1 = source.indexOf(string1);
  if(pos1 > -1) {
    pos2 = source.indexOf(string2,pos1 + string1.length);
    if(pos2 > pos1 && source.indexOf(string1,pos2 + string2.length) < -1) {
      return true;
    }
  }
  return false;
}

Note that my solution does not deal with the case where string2 is contained in string1, if that is the case you'll need to add that to the logic.

edited May 3, 2016 at 15:33

answered May 3, 2016 at 15:23

Michael Shopsin

2,1482 gold badges27 silver badges47 bronze badges

3 Comments

tobias_k Over a year ago

The idea is good, but this will fail if there are two occurrences of string2, one before and one after string1. Better find string1 first and then use its index as start index for the search for string2

user100001 Over a year ago

Unfortunately, this won't work as my function should be able to handle any regex. Thanks for the answer though.

Michael Shopsin Over a year ago

@user100001 Too bad, in a real life case with 20mb of text I used indexOf() and contains() as much as possible and regex only for the complex cases. The performance gain from using regex sparingly on large documents was several orders of magnitude.

Mike Samuel · Accepted Answer · 2016-05-03 18:09:37Z

0

^((?!substring1).)*substring1((?!substring2).)*substring2.*?\Z

Should do it because a string that contains one substring multiple times but not both in order won't backtrack ad nauseam. You can drop the .*?\Z at the end if you don't need the matcher to end at end of input.

answered May 3, 2016 at 18:09

Mike Samuel

121k30 gold badges230 silver badges255 bronze badges

Collectives™ on Stack Overflow

Regex pattern match performance in Java for long string

4 Answers 4

1 Comment

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related