5

I'm trying to get ALL the substrings in the input string that match the given pattern.

For example,

Given string: aaxxbbaxb
Pattern: a[a-z]{0,3}b
(What I actually want to express is: all the patterns that starts with a and ends with b, but can have up to 2 alphabets in between them)

Exact results that I want (with their indexes):

aaxxb: index 0~4
axxb: index 1~4
axxbb: index 1~5
axb: index 6~8

But when I run it through the Pattern and Matcher classes using Pattern.compile() and Matcher.find(), it only gives me:

aaxxb : index 0~4
axb : index 6~8

This is the piece of code I used.

Pattern pattern = Pattern.compile("a[a-z]{0,3}b", Pattern.CASE_INSENSITIVE);
Matcher match = pattern.matcher("aaxxbbaxb");
while (match.find()) {
    System.out.println(match.group());
}

How can I retrieve every single piece of string that matches the pattern?

Of course, it doesn't have to use Pattern and Matcher classes, as long as it's efficient :)

3
  • Why do you have the dot in here a[a-z].{0,2}b? If you want to have patern a_b where _ can be 0-2 alphabetical chars then the dot is wrong in there, doesn't it? Commented Sep 6, 2011 at 10:39
  • 2
    How is aaxxbb a string "that starts with a and ends with b" and can have up to two letters between? Commented Sep 6, 2011 at 10:40
  • Thanks Tom and jmg for pointing that out!!! I edited the original post. Commented Sep 6, 2011 at 13:42

3 Answers 3

3

(see: All overlapping substrings matching a java regex )

Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.

  public static void allMatches(String text, String regex)
  {
    for (int i = 0; i < text.length(); ++i) {
      for (int j = i + 1; j <= text.length(); ++j) {
        String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
        Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);

        if (m.find()) 
        {   
          System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
        }   
      }   
    }   
  }
Sign up to request clarification or add additional context in comments.

Comments

1

you are in effect searching for the strings ab, a_b, and a__b in an input string, where _ denotes a non-whitespace character whose value you do not care about.

That's three search targets. The most efficient way I can think of to do this would be to use a search algorithm like the Knuth-Morris-Pratt algorithm, with a few modifications. In effect your pseudocode would be something like:

for i in 0 to sourcestring.length
    check sourcestring[i] - is it a? if so, check sourcestring[i+x] 
       // where x is the index of the search string - 1
    if matches then save i to output list
    else i = i + searchstring.length

obviously if you have a position match you must then check the inner characters of the substring to make sure they are alphabetical.

run the algorithm 3 times, one for each search term. It will doubtless be much faster than trying to do the search using pattern matching.

edit - sorry, didn't read the question properly. If you have to use regex then the above will not work for you.

1 Comment

Hmm..searching for three individual targets. Thanks, I'll check it out!
0

One thing you could do is:

  • Create all possible Substrings that are 4 characters or longer (good luck with that if your String is large)
  • Create a new Matcher for each of these substrings
  • do a match() instead of a find()
  • calculate the absolute offset from the substring's relative offset and the matcher info

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.