522

I have this string:

mystring = 'Here is  some   text   I      wrote   '

How can I substitute the double, triple (...) whitespace chracters with a single space, so that I get:

mystring = 'Here is some text I wrote'
1
  • 6
    You should probably say 'substitute multiple whitespace with a single space' since whitespace is a class of characters (tabs, newlines etc.) Commented Jan 16, 2010 at 16:15

3 Answers 3

1079

A simple possibility (if you'd rather avoid REs) is

' '.join(mystring.split())

The split and join perform the task you're explicitly asking about -- plus, they also do the extra one that you don't talk about but is seen in your example, removing trailing spaces;-).

Sign up to request clarification or add additional context in comments.

9 Comments

Oh cool, I was fumbling with a similar solution, but using split(' ') and then a filter to remove empty elements. I never knew split with no arguments worked like this. This is also much faster, timeit.py gives me around 0.74usec for this, versus 5.75usec for regular expressions.
@Roman, yes, x.split() (and x.split(None)) splits on sequences of whitespace (including tabs, newlines, etc, like re's \s) of length 1+ -- and it's pretty fast indeed. So, always glad to help!
this is a very elegant solution, but I want to mention that this will also remove any linebreaks as well
To avoid '\n' from being mixed with ' ' one can use splitlines() like this: ' '.join((''.join(text.splitlines())).split())
To only strip consecutive repeated spaces one can use ' '.join(mystring.split(' ')). This will also remove the leading and trailing spaces but will keep newlines, tabs, etc.
|
203

A regular expression can be used to offer more control over the whitespace characters that are combined.

To match unicode whitespace:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str).strip()

To match ASCII whitespace only:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"(?a:\s+)")
_RE_STRIP_WHITESPACE = re.compile(r"(?a:^\s+|\s+$)")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str)
my_str = _RE_STRIP_WHITESPACE.sub("", my_str)

Matching only ASCII whitespace is sometimes essential for keeping control characters such as x0b, x0c, x1c, x1d, x1e, x1f.

Reference:

About \s:

For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

About re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

strip() will remote any leading and trailing whitespaces.

2 Comments

If you really only want to replace spaces (' '), use re.sub(' +', ' ', mystring).strip()
I don't really understand what's happening with this question, which appears no longer to be accepting answers. The most obvious answer of all (as put forward in a now deleted question from 2010) is re.sub(r'\s+', ' ', mystring) ... this succintly does what the OP asked (to be fair, the question is not well expressed: you can't "replace with a whitespace", but you can "replace with a space").
49

For completeness, you can also use:

mystring = mystring.strip()  # the while loop will leave a trailing space, 
                  # so the trailing whitespace must be dealt with
                  # before or after the while loop
while '  ' in mystring:
    mystring = mystring.replace('  ', ' ')

which will work quickly on strings with relatively few spaces (faster than re in these situations).

In any scenario, Alex Martelli's split/join solution performs at least as quickly (usually significantly more so).

In your example, using the default values of timeit.Timer.repeat(), I get the following times:

str.replace: [1.4317800167340238, 1.4174888149192384, 1.4163512401715934]
re.sub:      [3.741931446594549,  3.8389395858970374, 3.973777672860706]
split/join:  [0.6530919432498195, 0.6252146571700905, 0.6346594329726258]


EDIT:

Just came across this post which provides a rather long comparison of the speeds of these methods.

2 Comments

More lines than the others, and thus less "pythonic", but clearer.
A reminder, this one has the risk of being infinite loop if you typo.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.