1

I am trying to use a multi-line regex to match all wildcards in a given source string. These strings can be in excess of 70,000 lines and each item is separated by a new line.

I seem to be experiencing huge processing times for my current regex and I can only assume that this is because it is probably poorly constructed and inefficient. If I execute the code on my phone it seems to run for an eternity.

My current regex:

(?im)(?=^(?:\*|.+\*$))^(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$

Valid wildcard examples:

*test.com
*.test.com
*test
test.*
test*
*test*

I compile the pattern with:

private static final String WILDCARD_PATTERN = "(?im)(?=^(?:\\*|.+\\*$))^(?:\\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\\.)?)+(?:[a-z0-9]+)(?:[.-]?\\*)?$";
private static final Pattern wildcard_r = Pattern.compile(WILDCARD_PATTERN);

I look for matches with:

// Wildcards
while (wildcardPatternMatch.find()) {
    String wildcard = wildcardPatternMatch.group();
    myProperty.add(new property(wildcard, providerId));
    System.out.println(wildcard);
}

Are there any changes I can make to improve / optimise the regex or do I need to look at running .replaceAll several times to remove all of the clutter before passing for regex matching?

8

2 Answers 2

2

The pattern you need is

(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$

See the regex demo

Main points:

  • The first lookahead should be after ^. If it is before, the check is done before and after each char in the string. Once it is after ^, it is only performed once at the start of a line
  • The (?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ part, although short, is actually killing performance since the (?:(?<!-)\.)? is optional pattern, and the whole pattern gets reduced to (a+)+, a known type of pattern that causes catastrphic backtracking granted there are other subpatterns to the right of it. You need to unwrap it, the best "linear" way is [a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*.

The rest is OK.

Details

  • (?im) - case insensitive and multiline modifiers
  • ^ - start of a line
  • (?=\*|.+\*$) - the string should either start or end with *
  • (?:\*[.-]?)? - an optional substring matching a * and an optional . or -` char
  • [a-z0-9](?:[a-z0-9-]*[a-z0-9])? - an alphanumeric char followed with an optional sequence of any 0+ alphanumeric chars or/and - followed with an alphanumeric char
  • (?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)* - 0 or more sequences of a dot followed with the pattern described above
  • (?:[.-]?\*)? - an optional substring matching an optional . or -char and then a*`
  • $ - end of a line.
Sign up to request clarification or add additional context in comments.

1 Comment

Excellent. Thanks so much for taking the time to give a detailed explanation.
1

I'd suggest taking a look at https://en.wikipedia.org/wiki/ReDoS#Evil_regexes

Your regex contains a repeated pattern:

(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ 

Just as a quick example of how this might slow it down, take a look at the processing time on these two examples: exact matches versus having extra characters at the end and even worse, that set repeated several times

Edit: Another good reference

1 Comment

No problem, I also noticed when I looked at it that putting *, *-, -- or - in the inputs caused it to take a long time. I'm not sure if your data would contain those strings by any chance, but it'd be worth looking changing the regex to accommodate that into if it does.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.