2

I don't know why but when I try to run this program, it's looks like the program will run forever.

package fjr.test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test3 {

    public static void main(String[] args){

        String regex = "dssdfsdfdsf wdasdads dadlkn mdsds .";

        Pattern p  = Pattern.compile("^([a-zA-Z]+ *)+$"); 

        Matcher match =  p.matcher(regex); 

        if(match.matches()){
            System.out.println("Yess"); 
        }else{
            System.out.println("No.. "); 
        }

        System.out.println("FINISH..."); 
    }

}

What I need to do was to match the pattern that contain a bunch of word that only separated by spaces

5
  • 1
    can you give an example. or the above result you want Commented Aug 11, 2014 at 4:27
  • I just need to match the sentences (a bunch of word) that only contain alphabetical character, that separated by one or more spaces... you can try above program.. Commented Aug 11, 2014 at 4:30
  • But why when I add a period (.) at the end of sentence, the result not like what I expecting (to print "NO" word) Commented Aug 11, 2014 at 4:32
  • between space or either side? Commented Aug 11, 2014 at 4:32
  • I mean a space separate some word.. like person name that contain two or more word (not contain a number) Commented Aug 11, 2014 at 4:34

4 Answers 4

8

Your program has likely encountered what's called catastrophic backtracking.

If you have a bit of time, let's look at how your regex works...

Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.

On the left hand side, we have our pattern:

/^([a-zA-Z]+ *)+$/

And here's the String to match:

dssdfsdfdsf wdasdads dadlkn mdsds .

From the regex101 debugger, your regex took 78540 steps to fail. This is because you used quantifiers that are greedy and not possessive (backtracking).

x

... Long story short, because the input string fails to match, every quantifier within your regex causes indefinite backtracking - Every character is released from + and then * and then both and then a group is released from ()+ to backtrack more.

Here's a few solutions you should follow:

Avoid abundant quantifiers!

If you revise your expression, you'll see that the pattern is logically same as:

/^[a-zA-Z]+( +[a-zA-Z]+)*$/

This uses a step of logical induction to reduce the regex upstairs to match far quicker, now at 97 steps!

y

Use possessive quantifiers while you can!

As I mentioned, /^([a-zA-Z]+ *)+$/ is evil because it backtracks in a terrible manner. We're in Java, what can we do?

This solution works only because [a-zA-Z] and matches distinct items. We can use a possessive group!

/^([a-zA-Z]++ *+)++$/
            ^  ^  ^

These simple "+" denotes "We're not backtracking if we fail the match from here". This is an extremely effective solution, and cuts off any need for backtracking. Whenever you have two distinct groups with a quantifier in between, use them. And if you need some proof on the effectiveness, here's our scorecard:

z

Read also:

Online Demos:

Sign up to request clarification or add additional context in comments.

Comments

2

This does terminate, but it does take 10 seconds or so. Some observations:

  • Removing the fullstop from the end of the test string makes it fast.
  • Changing the * in the regex to a + (which i believe is actually what you want) makes it fast. I think having the option of 0 characters in that spot expands the state space a lot. I would use:

    ^(\w+ +)*\w+$"

Which means a bunch of (word space), followed by a word. Ran against your example and it is fast

Comments

1

You can use this regex to match all word with spaces or none.

sample:

 Pattern p  = Pattern.compile("([a-zA-Z ]+)"); 

2 Comments

So what happens with my program above, can you give some logical explanation?
@fajar66 you are matching it + and * at the same timw which takes time to circulate the whole string
0

Interesting phenomena. This is related on Greedy quantifiers's behaviour & performance.
Based on your pattern ^([a-zA-Z]+ *)+$ , and your inpput "dssdfsdfdsf wdasdads dadlkn mdsds ." , the patther doesn't match your input, however, the java regex will backtrack all the possibilities of ^([a-zA-Z]+ *)+, (see below examples) and then obtain the not-match results.

(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsds )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsd)(s )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mds)(ds )
...
(d)(s)(s)(d)(f)(s)(d)(f)(d)(s)(f )(w)(d)(a)(s)(d)(a)(d)(s )(d)(a)(d)(l)(k)(n )(m)(d)(s)(d)(s )
...
(dssdfsdfdsf )
...
(d)

The backtrack could be more than 200,000,000 times.
I'm a little bit curious why java-regex can't do some performance improvement, since after first try, it found the epxected char is '.', not '$', then any further backtrack won't be successful. it's useless to do the backtrack.

Therefore, when we define the Loop pattern, we need to pay more attenttion on the internal Loop pattern (e.g.: [a-zA-Z]+ * in your example), not making it matching so many case. Otherwise, the backtrack numbers for the whole Loop will be huge.
Another similar exmaple (more bad then your case):
Pattern: "(.+)*A"
Input: "abcdefghijk lmnopqrs tuvwxy zzzzz A" -- So quick
Input: "abcdefghijAk lmnopqrs tuvwxy zzzzz " -- Not bad, just wait for a while
Input: "aAbcdefghijk lmnopqrs tuvwxy zzzzz " -- it seems infinit. (acutally, not, but I have no patience to wait its finishing.)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.