127

The Java API for regular expressions states that \s will match whitespace. So the regex \\s\\s should match two spaces.

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");

The aim of this is to replace all instances of two consecutive whitespace with a single space. However this does not actually work.

Am I having a grave misunderstanding of regexes or the term "whitespace"?

4
  • 2
    String has a replaceAll function that will save you a few lines of code. download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html Commented Jan 19, 2011 at 2:05
  • 1
    It isn’t your misunderstanding, but Java’s. Try splitting a string like "abc \xA0 def \x85 xyz" to see what I mean: there are only three fields there. Commented Apr 11, 2011 at 15:15
  • 5
    Did you try "\\s+". With this you replace two or more spaces to one. Commented May 5, 2013 at 12:33
  • 1
    I've been wondering for over an hour why my \\s split is not splitting over whitespace. Thanks a million! Commented May 18, 2014 at 0:28

11 Answers 11

219

You can’t use \s in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.

Unicode defines 26 code points as \p{White_Space}: 20 of them are various sorts of \pZ GeneralCategory=Separator, and the remaining 6 are \p{Cc} GeneralCategory=Control.

White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

Now you can use whitespace_charclass + "+" as the pattern in your replaceAll.


Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.

And if you think white space is bad, you should see what you have to do to get \w and \b to finally behave properly!

Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.

If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.

Sign up to request clarification or add additional context in comments.

10 Comments

this is really old. is it correct that this was fixed in java7 with the UNICODE_CHARACTER_CLASS flag? (or using (?U))
A shorter way to rewrite \s is [\s\u0085\p{Z}].
@tchrist If this is fixed in java 7+, could you update the answer with the now-correct way to do this?
With Java 7+ you can do: "(?U)\s" to run the regex with Unicode Technical Standard conformance. Or you can make the UNICODE_CHARACTER_CLASS flag true when creating the pattern. Here's the doc: docs.oracle.com/javase/7/docs/api/java/util/regex/…
The above code is missing \\u200B (ZERO WIDTH SPACE)
|
47

Yeah, you need to grab the result of matcher.replaceAll():

String result = matcher.replaceAll(" ");
System.out.println(result);

4 Comments

Gah. I feel like the biggest idiot on earth. Neither I nor two other people seemed to notice that. I guess the stupidest little errors throw us off sometimes, eh?
So true! I guess that happens with the best of them
What happen if I need get if the text had White Spaces.?
Per my answer below use \p{Zs} instead of \s if you want to match unicode whitespace.
18

For Java (not php, not javascript, not anyother):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

4 Comments

Strings are immutable, thus you have to assign the result to something, such as 'txt = txt.replaceAll()' I did not vote-down your answer, but that might be why someone else did so.
I know replaceAll returns a string the important thing 4 java programers is\\p{javaSpaceChar}
The original question made the mistake of not assigning the new string to a variable. Pointing out that mistake is thus the most important point of the answer.
This totally solved my problem in Groovy! Finally! Been trying every regex I could find that would match all white space including NON-BREAK-SPACE (ASCII 160)!!!
12

Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the \p{Zs} group.

Thus if you wanted to replace one or more exotic spaces with a plain space you could do this:

String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");

Also worth knowing, if you've used the trim() string function you should take a look at the (relatively new) strip(), stripLeading(), and stripTrailing() functions on strings. They can help you trim off all sorts of squirrely white space characters. For more information on what what space is included, see Java's Character.isWhitespace() function.

3 Comments

As a heads up, this does not match newlines, but this does.
@CaptainMan the answer you reference leaves out a small note from the JavaDoc: "Specifying this flag may impose a performance penalty." To avoid that performance hit I would suggest \p{Zl} for line separators and \p{Zp} for paragraph separators.
Expanded: txt.replaceAll("(\\p{Zs}|\\p{Zl}|\\p{Zp})+", " "); to replace all sorts of separators with a single space character.
9

To match any whitespace character, you can use

Pattern whitespace = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS);

The Pattern.UNICODE_CHARACTER_CLASS option "enables the Unicode version of Predefined character classes and POSIX character classes" that are then "in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties".

The same behavior can also be enabled with the (?U) embedded flag expression. For example, if you want to replace/remove all Unicode whitespaces in Java with regex, you can use

String result = text.replaceAll("(?U)\\s+", ""); // removes all whitespaces
String result = text.replaceAll("(?U)\\s", "-"); // replaces each single whitespace with -
String result = text.replaceAll("(?U)\\s+", "-"); // replaces chunks of one or more consecutive whitespaces with a single -
String result = text.replaceAll("(?U)\\G\\s", "-"); // replaces each single whitespace at the start of string with -

See the Java demo online:

String text = "\u00A0 \u00A0\tStart reading\u00A0here..."; // \u00A0 - non-breaking space
System.out.println("Text: '" + text + "'"); // => Text: '       Start reading here...'
System.out.println(text.replaceAll("(?U)\\s+", "")); // => Startreadinghere...
System.out.println(text.replaceAll("(?U)\\s", "-")); // => ----Start-reading-here...
System.out.println(text.replaceAll("(?U)\\s+", "-")); // => -Start-reading-here...
System.out.println(text.replaceAll("(?U)\\G\\s", "-")); // => ----Start reading here... 

Comments

6

Seems to work for me:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

will print:

" a  b   c"

I think you intended to do this instead of your code:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);

Comments

6

when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:

"Message author: Jan Goyvaerts

In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).

... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."

1 Comment

[\s\p{z}] omits Unicode "next line" character U+0085. Use [\s\u0085\p{Z}].
3
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);

boolean flag = true;
while(flag)
{
 //Update your original search text with the result of the replace
 modLine = matcher.replaceAll(" ");
 //reset matcher to look at this "new" text
 matcher = whitespace.matcher(modLine);
 //search again ... and if no match , set flag to false to exit, else run again
 if(!matcher.find())
 {
 flag = false;
 }
}

2 Comments

Mike, while I appreciate you taking the time to answer, this question has been solved several months ago. There is no need to answer questions as old as this.
If someone can show a different, better solution, answering old questions is perfectly legit.
3

For your purpose you can use this snnippet:

import org.apache.commons.lang3.StringUtils;

StringUtils.normalizeSpace(string);

This will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.

String sampleString = "Hello    world!";
sampleString.replaceAll("\\s{2}", " "); // replaces exactly two consecutive spaces
sampleString.replaceAll("\\s{2,}", " "); // replaces two or more consecutive white spaces

Comments

0

You can use simpler:

String out = in.replaceAll(" {2}", " ");

Comments

-3

Use of whitespace in RE is a pain, but I believe they work. The OP's problem can also be solved using StringTokenizer or the split() method. However, to use RE (uncomment the println() to view how the matcher is breaking up the String), here is a sample code:

import java.util.regex.*;

public class Two21WS {
    private String  str = "";
    private Pattern pattern = Pattern.compile ("\\s{2,}");  // multiple spaces

    public Two21WS (String s) {
            StringBuffer sb = new StringBuffer();
            Matcher matcher = pattern.matcher (s);
            int startNext = 0;
            while (matcher.find (startNext)) {
                    if (startNext == 0)
                            sb.append (s.substring (0, matcher.start()));
                    else
                            sb.append (s.substring (startNext, matcher.start()));
                    sb.append (" ");
                    startNext = matcher.end();
                    //System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
                    //                      ", sb: \"" + sb.toString() + "\"");
            }
            sb.append (s.substring (startNext));
            str = sb.toString();
    }

    public String toString () {
            return str;
    }

    public static void main (String[] args) {
            String tester = " a    b      cdef     gh  ij   kl";
            System.out.println ("Initial: \"" + tester + "\"");
            System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}

It produces the following (compile with javac and run at the command prompt):

% java Two21WS Initial: " a b cdef gh ij kl" Two21WS: " a b cdef gh ij kl"

1 Comment

WTF!? Why would you want to do all that when you can just call replaceAll() instead?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.