12

I have a Java regex pattern and a sentence I'd like to completely match, but for some sentencecs it erroneously fails. Why is this? (for simplicity, I won't use my complex regex, but just ".*")

System.out.println(Pattern.matches(".*", "asdf"));
System.out.println(Pattern.matches(".*", "[11:04:34] <@Aimbotter> 1 more thing"));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));
System.out.println(Pattern.matches(".*", "[11:04:35] <@Aimbotter> Dialogue: 0,0:00:00.00,0:00:00.00,Default,{Orginal LV,0000,0000,0000,,[???]????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????} "));

Output:

true
true
true
false

Note that the fourth sentence contains 10 unicode control characters \u0085 in between the question marks, which aren't shown by normal fonts. The third and fourth sentences actually contain the same amount of characters!

3
  • This is especially odd because Java is a Unicode regex engine... Commented May 12, 2011 at 12:52
  • It would be worse if Java would not know about Unicode line terminators (fileformat.info/info/unicode/char/85/index.htm) Commented May 12, 2011 at 13:03
  • ...@tchrist will soon be around and tell us all about how broken the java regex engine is. Commented May 12, 2011 at 13:05

4 Answers 4

13

use

Pattern.compile(".*",Pattern.DOTALL)

if you want . to match control characters. By default it only matches printable characters.

From JavaDoc:

"In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)"

Code in Pattern (there is your \u0085):

/**
 * Implements the Unicode category ALL and the dot metacharacter when
 * in dotall mode.
 */
static final class All extends CharProperty {
boolean isSatisfiedBy(int ch) {
    return true;
}
}

/**
 * Node class for the dot metacharacter when dotall is not enabled.
 */
static final class Dot extends CharProperty {
boolean isSatisfiedBy(int ch) {
    return (ch != '\n' && ch != '\r'
                && (ch|1) != '\u2029'
                && ch != '\u0085');
    }
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, (?s) worked. I didn't try Pattern.DOTALL because I have a ton of different compiled patterns, and I only had to use (?s) once (in a string constant that I include in most patterns).
4

The answer is in the question : 10 unicode control characters \u0085

unicode control characters arent recognized by .* just like \n

Comments

2

Unicode /u0085 is newline - so you have to either add (?s) - dot matches all - to the beginning of your regex or add the flag when compiling the regex.

Pattern.matches("(?s).*", "blahDeBlah\u0085Blah")

1 Comment

Not (?m)- Multiline mode means that ^ and $ match at start/end of lines. You want (?s) for singleline mode. Yes, it is confusing (the idea is to "treat the entire input as if it were a single line").
1

The problem I believe is that \u0085 represents a newline. If you want multiline matching you need to use Pattern.MULTILINE or Pattern.DOTALL. It's not the fact it is Unicode - '\n' would fail too.

To use it:Pattern.compile(regex, Pattern.DOTALL).matcher(input).matches()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.