Whitespace Matching Regex - Java

Question

The Java API for regular expressions states that \s will match whitespace. So the regex \\s\\s should match two spaces.

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");

The aim of this is to replace all instances of two consecutive whitespace with a single space. However this does not actually work.

Am I having a grave misunderstanding of regexes or the term "whitespace"?

String has a replaceAll function that will save you a few lines of code. download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html — Zach L
– Zach L, Commented Jan 19, 2011 at 2:05
It isn’t your misunderstanding, but Java’s. Try splitting a string like "abc \xA0 def \x85 xyz" to see what I mean: there are only three fields there. — tchrist
– tchrist, Commented Apr 11, 2011 at 15:15
Did you try "\\s+". With this you replace two or more spaces to one. — hrzafer
– hrzafer, Commented May 5, 2013 at 12:33
I've been wondering for over an hour why my \\s split is not splitting over whitespace. Thanks a million! — Marcin
– Marcin, Commented May 18, 2014 at 0:28

Community · Accepted Answer · 2019-07-10 15:45:16Z

219

You can’t use \s in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.

Unicode defines 26 code points as \p{White_Space}: 20 of them are various sorts of \pZ GeneralCategory=Separator, and the remaining 6 are \p{Cc} GeneralCategory=Control.

White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

Now you can use whitespace_charclass + "+" as the pattern in your replaceAll.

Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.

And if you think white space is bad, you should see what you have to do to get \w and \b to finally behave properly!

Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.

If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.

edited Jul 10, 2019 at 15:45

CommunityBot

11 silver badge

answered Jan 19, 2011 at 2:16

tchrist

80.7k31 gold badges135 silver badges186 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

kritzikratzi Over a year ago

this is really old. is it correct that this was fixed in java7 with the UNICODE_CHARACTER_CLASS flag? (or using (?U))

Robert Tupelo-Schneck Over a year ago

A shorter way to rewrite \s is [\s\u0085\p{Z}].

beerbajay Over a year ago

@tchrist If this is fixed in java 7+, could you update the answer with the now-correct way to do this?

Didier A. Over a year ago

With Java 7+ you can do: "(?U)\s" to run the regex with Unicode Technical Standard conformance. Or you can make the UNICODE_CHARACTER_CLASS flag true when creating the pattern. Here's the doc: docs.oracle.com/javase/7/docs/api/java/util/regex/…

Jelle van Geuns Over a year ago

The above code is missing \\u200B (ZERO WIDTH SPACE)

|

Georg Plaz · Accepted Answer · 2020-05-31 09:06:23Z

47

Yeah, you need to grab the result of matcher.replaceAll():

String result = matcher.replaceAll(" ");
System.out.println(result);

edited May 31, 2020 at 9:06

Georg Plaz

6,0285 gold badges44 silver badges66 bronze badges

answered Jan 19, 2011 at 2:02

Raph Levien

5,23828 silver badges24 bronze badges

4 Comments

user372743 Over a year ago

Gah. I feel like the biggest idiot on earth. Neither I nor two other people seemed to notice that. I guess the stupidest little errors throw us off sometimes, eh?

saibharath Over a year ago

So true! I guess that happens with the best of them

Gilberto Ibarra Over a year ago

What happen if I need get if the text had White Spaces.?

Robert Over a year ago

Per my answer below use \p{Zs} instead of \s if you want to match unicode whitespace.

surfealokesea · Accepted Answer · 2013-06-11 16:11:53Z

18

For Java (not php, not javascript, not anyother):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

edited Jun 11, 2013 at 16:11

answered Jun 11, 2013 at 10:27

surfealokesea

5,1735 gold badges32 silver badges38 bronze badges

4 Comments

Enwired Over a year ago

Strings are immutable, thus you have to assign the result to something, such as 'txt = txt.replaceAll()' I did not vote-down your answer, but that might be why someone else did so.

surfealokesea Over a year ago

I know replaceAll returns a string the important thing 4 java programers is\\p{javaSpaceChar}

Enwired Over a year ago

The original question made the mistake of not assigning the new string to a variable. Pointing out that mistake is thus the most important point of the answer.

Piko Over a year ago

This totally solved my problem in Groovy! Finally! Been trying every regex I could find that would match all white space including NON-BREAK-SPACE (ASCII 160)!!!

Robert · Accepted Answer · 2021-08-21 11:24:01Z

12

Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the \p{Zs} group.

Thus if you wanted to replace one or more exotic spaces with a plain space you could do this:

String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");

Also worth knowing, if you've used the trim() string function you should take a look at the (relatively new) strip(), stripLeading(), and stripTrailing() functions on strings. They can help you trim off all sorts of squirrely white space characters. For more information on what what space is included, see Java's Character.isWhitespace() function.

edited Aug 21, 2021 at 11:24

answered Oct 24, 2019 at 11:43

Robert

1,37217 silver badges21 bronze badges

3 Comments

Captain Man Over a year ago

As a heads up, this does not match newlines, but this does.

Robert Over a year ago

@CaptainMan the answer you reference leaves out a small note from the JavaDoc: "Specifying this flag may impose a performance penalty." To avoid that performance hit I would suggest \p{Zl} for line separators and \p{Zp} for paragraph separators.

Robert Over a year ago

Expanded: txt.replaceAll("(\\p{Zs}|\\p{Zl}|\\p{Zp})+", " "); to replace all sorts of separators with a single space character.

Wiktor Stribiżew · Accepted Answer · 2021-09-11 15:32:04Z

To match any whitespace character, you can use

Pattern whitespace = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS);

The Pattern.UNICODE_CHARACTER_CLASS option "enables the Unicode version of Predefined character classes and POSIX character classes" that are then "in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties".

The same behavior can also be enabled with the (?U) embedded flag expression. For example, if you want to replace/remove all Unicode whitespaces in Java with regex, you can use

String result = text.replaceAll("(?U)\\s+", ""); // removes all whitespaces
String result = text.replaceAll("(?U)\\s", "-"); // replaces each single whitespace with -
String result = text.replaceAll("(?U)\\s+", "-"); // replaces chunks of one or more consecutive whitespaces with a single -
String result = text.replaceAll("(?U)\\G\\s", "-"); // replaces each single whitespace at the start of string with -

See the Java demo online:

String text = "\u00A0 \u00A0\tStart reading\u00A0here..."; // \u00A0 - non-breaking space
System.out.println("Text: '" + text + "'"); // => Text: '       Start reading here...'
System.out.println(text.replaceAll("(?U)\\s+", "")); // => Startreadinghere...
System.out.println(text.replaceAll("(?U)\\s", "-")); // => ----Start-reading-here...
System.out.println(text.replaceAll("(?U)\\s+", "-")); // => -Start-reading-here...
System.out.println(text.replaceAll("(?U)\\G\\s", "-")); // => ----Start reading here...

Mihai Toader · Accepted Answer · 2011-01-19 02:01:51Z

6

Seems to work for me:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

will print:

" a  b   c"

I think you intended to do this instead of your code:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);

answered Jan 19, 2011 at 2:01

Mihai Toader

12.3k1 gold badge31 silver badges33 bronze badges

Comments

Tuomas · Accepted Answer · 2014-11-03 12:01:11Z

6

when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:

"Message author: Jan Goyvaerts

In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).

... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."

answered Nov 3, 2014 at 12:01

Tuomas

1692 silver badges4 bronze badges

1 Comment

Robert Tupelo-Schneck Over a year ago

[\s\p{z}] omits Unicode "next line" character U+0085. Use [\s\u0085\p{Z}].

Mike · Accepted Answer · 2011-09-15 12:51:05Z

3

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);

boolean flag = true;
while(flag)
{
 //Update your original search text with the result of the replace
 modLine = matcher.replaceAll(" ");
 //reset matcher to look at this "new" text
 matcher = whitespace.matcher(modLine);
 //search again ... and if no match , set flag to false to exit, else run again
 if(!matcher.find())
 {
 flag = false;
 }
}

answered Sep 15, 2011 at 12:51

Mike

311 bronze badge

2 Comments

user372743 Over a year ago

Mike, while I appreciate you taking the time to answer, this question has been solved several months ago. There is no need to answer questions as old as this.

james.garriss Over a year ago

If someone can show a different, better solution, answering old questions is perfectly legit.

Jesper · Accepted Answer · 2020-01-29 10:00:22Z

3

For your purpose you can use this snnippet:

import org.apache.commons.lang3.StringUtils;

StringUtils.normalizeSpace(string);

This will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.

String sampleString = "Hello    world!";
sampleString.replaceAll("\\s{2}", " "); // replaces exactly two consecutive spaces
sampleString.replaceAll("\\s{2,}", " "); // replaces two or more consecutive white spaces

edited Jan 29, 2020 at 10:00

Jesper

2,2344 gold badges26 silver badges33 bronze badges

answered May 18, 2018 at 19:42

Rashid Mv

4144 silver badges8 bronze badges

Comments

Bokili Production · Accepted Answer · 2022-11-07 20:22:10Z

0

You can use simpler:

String out = in.replaceAll(" {2}", " ");

answered Nov 7, 2022 at 20:22

Bokili Production

4542 silver badges11 bronze badges

Comments

Manidip Sengupta · Accepted Answer · 2011-01-19 04:10:51Z

Use of whitespace in RE is a pain, but I believe they work. The OP's problem can also be solved using StringTokenizer or the split() method. However, to use RE (uncomment the println() to view how the matcher is breaking up the String), here is a sample code:

import java.util.regex.*;

public class Two21WS {
    private String  str = "";
    private Pattern pattern = Pattern.compile ("\\s{2,}");  // multiple spaces

    public Two21WS (String s) {
            StringBuffer sb = new StringBuffer();
            Matcher matcher = pattern.matcher (s);
            int startNext = 0;
            while (matcher.find (startNext)) {
                    if (startNext == 0)
                            sb.append (s.substring (0, matcher.start()));
                    else
                            sb.append (s.substring (startNext, matcher.start()));
                    sb.append (" ");
                    startNext = matcher.end();
                    //System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
                    //                      ", sb: \"" + sb.toString() + "\"");
            }
            sb.append (s.substring (startNext));
            str = sb.toString();
    }

    public String toString () {
            return str;
    }

    public static void main (String[] args) {
            String tester = " a    b      cdef     gh  ij   kl";
            System.out.println ("Initial: \"" + tester + "\"");
            System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}

It produces the following (compile with javac and run at the command prompt):

% java Two21WS Initial: " a b cdef gh ij kl" Two21WS: " a b cdef gh ij kl"

WTF!? Why would you want to do all that when you can just call replaceAll() instead?

Collectives™ on Stack Overflow

Whitespace Matching Regex - Java

11 Answers 11

10 Comments

4 Comments

4 Comments

3 Comments

Comments

Comments

1 Comment

2 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

10 Comments

4 Comments

4 Comments

3 Comments

Comments

Comments

1 Comment

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related