21

I have tests where I validate the output with a regex. When it fails it reports that output X did not match regex Y.

I would like to add some indication of where in the string the match failed. E.g. what is the farthest the matcher got in the string before backtracking. Matcher.hitEnd() is one case of what I'm looking for, but I want something more general.

Is this possible to do?

2
  • 1
    This is probably your best bet: stackoverflow.com/questions/2348694/how-do-you-debug-a-regex Commented May 4, 2011 at 19:55
  • @Reverend Gonzo: Thanks, Perl's "use re 'debug'" is close to what I'm looking for. Something similar that is callable from Java would be great. Commented May 7, 2011 at 0:24

4 Answers 4

10

If a match fails, then Match.hitEnd() tells you whether a longer string could have matched. In addition, you can specify a region in the input sequence that will be searched to find a match. So if you have a string that cannot be matched, you can test its prefixes to see where the match fails:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LastMatch {
    private static int indexOfLastMatch(Pattern pattern, String input) {
        Matcher matcher = pattern.matcher(input);
        for (int i = input.length(); i > 0; --i) {
            matcher.region(0, i);
            if (matcher.matches() || matcher.hitEnd()) {
                return i;
            }
        }

        return 0;
    }

    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("[A-Z]+[0-9]+[a-z]+");
        String[] samples = {
                // partial matches, need more text
                "",
                "ABC",
                "ABC123",

                // full matches
                "A1x",
                "ABC123xyz",

                // mismatches
                "-BC123xyz",
                "A-C123xyz",
                "AB-123xyz",
                "ABC-23xyz",
                "ABC1-3xyz",
                "ABC12-xyz",
                "ABC123-yz",
                "ABC123x-z",
                "ABC123xy-"
        };

        for (String sample : samples) {
            System.out.printf("`%s`: last match at %d of %d\n",
                    sample, indexOfLastMatch(pattern, sample), sample.length());
        }
    }
}

The output of this class is:

``: last match at 0 of 0
`ABC`: last match at 3 of 3
`ABC123`: last match at 6 of 6
`A1x`: last match at 3 of 3
`ABC123xyz`: last match at 9 of 9
`-BC123xyz`: last match at 0 of 9
`A-C123xyz`: last match at 1 of 9
`AB-123xyz`: last match at 2 of 9
`ABC-23xyz`: last match at 3 of 9
`ABC1-3xyz`: last match at 4 of 9
`ABC12-xyz`: last match at 5 of 9
`ABC123-yz`: last match at 6 of 9
`ABC123x-z`: last match at 7 of 9
`ABC123xy-`: last match at 8 of 9

As suggested by Don Hatch, using binary search reduces the number of required matches from O(N) to O(log N), where N is the length of the input string -- that is the overall complexity is O(N log N) instead of O(N * N) for the linear search in indexOfLastMatch.

private static int binarySearchLastMatch(Pattern pattern, String input) {
    Matcher matcher = pattern.matcher(input);
    int lastMatch = 0;

    int left = 0;
    int right = input.length();
    while (left <= right) {
        int mid = left + (right - left) / 2;
        matcher.region(0, mid);
        if (matcher.matches() || matcher.hitEnd()) {
            lastMatch = mid;
            left = mid + 1;
        } else {
            right = mid - 1;
        }
    }

    return lastMatch;
}
Sign up to request clarification or add additional context in comments.

5 Comments

This is good, though I find the second case confusing. The entire string matches, so why report 4? I suggest: region.matches(); if (region.hitEnd()) .... Then it returns 6 for that case.
Good catch. I only tested for partial matches and didn't take into account full matches of the complete string or any of its prefixes. This is now fixed.
If I'm not mistaken this is O(n^2)-ish, and could be made O(n log n)-ish by binary search.
Maybe a minor point, but I find it non-obvious that matcher.region() actually returns matcher (which got modified). It works, but will fail if, say, someone modifies the code to use matcher after the loop, wrongly thinking it's the original matcher value. This could perhaps be made less error-prone by refraining from using the return value of matcher.region(), and just use matcher directly throughout.
@DonHatch Fair points. Since matcher.region() does not return a new instance, but changes the matcher itself, there is really no point to introduce a new variable for that. So I have updated my code snippet. I also added an alternative implementation that uses binary search to find the index of the last match (or rather first mismatch). However, nowadays any AI could provide you that. :-)
4

You can take the string, and iterate over it, removing one more char from its end at every iteration, and then check for hitEnd():

int farthestPoint(Pattern pattern, String input) {
    for (int i = input.length() - 1; i > 0; i--) {
        Matcher matcher = pattern.matcher(input.substring(0, i));
        if (!matcher.matches() && matcher.hitEnd()) {
            return i;
        }
    }
    return 0;
}

2 Comments

intput.length()
If I'm not mistaken this is O(n^2)-ish, and could be made O(n log n)-ish by binary search.
1

You could use a pair of replaceAll() calls to indicate the positive and negative matches of the input string. Let's say, for example, you want to validate a hex string; the following will indicate the valid and invalid characters of the input string.

String regex = "[0-9A-F]"
String input = "J900ZZAAFZ99X"
Pattern p = Pattern.compile(regex)
Matcher m = p.matcher(input)
String mask = m.replaceAll('+').replaceAll('[^+]', '-')
System.out.println(input)
System.out.println(mask)

This would print the following, with a + under valid characters and a - under invalid characters.

J900ZZAAFZ99X
-+++--+++-++-

Comments

0

If you want to do it outside of the code, I use rubular to test the regex expressions before sticking them in the code.

1 Comment

Unless I'm missing something, this tells me if text doesn't match a regex, but it doesn't tell me where it fails to match. That's what I'm looking for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.