Substitute multiple whitespace with single whitespace in Python [duplicate]

Question

I have this string:

mystring = 'Here is  some   text   I      wrote   '

How can I substitute the double, triple (...) whitespace chracters with a single space, so that I get:

mystring = 'Here is some text I wrote'

You should probably say 'substitute multiple whitespace with a single space' since whitespace is a class of characters (tabs, newlines etc.) — Noufal Ibrahim
– Noufal Ibrahim, Commented Jan 16, 2010 at 16:15

Alex Martelli · Accepted Answer · 2010-01-16 15:54:24Z

1079

A simple possibility (if you'd rather avoid REs) is

' '.join(mystring.split())

The split and join perform the task you're explicitly asking about -- plus, they also do the extra one that you don't talk about but is seen in your example, removing trailing spaces;-).

answered Jan 16, 2010 at 15:54

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Roman Over a year ago

Oh cool, I was fumbling with a similar solution, but using split(' ') and then a filter to remove empty elements. I never knew split with no arguments worked like this. This is also much faster, timeit.py gives me around 0.74usec for this, versus 5.75usec for regular expressions.

Alex Martelli Over a year ago

@Roman, yes, x.split() (and x.split(None)) splits on sequences of whitespace (including tabs, newlines, etc, like re's \s) of length 1+ -- and it's pretty fast indeed. So, always glad to help!

trudolf Over a year ago

this is a very elegant solution, but I want to mention that this will also remove any linebreaks as well

Pradeep Singh Over a year ago

To avoid '\n' from being mixed with ' ' one can use splitlines() like this: ' '.join((''.join(text.splitlines())).split())

FifthAxiom Over a year ago

To only strip consecutive repeated spaces one can use ' '.join(mystring.split(' ')). This will also remove the leading and trailing spaces but will keep newlines, tabs, etc.

|

jaumebonet · Accepted Answer · 2020-03-11 09:14:20Z

203

A regular expression can be used to offer more control over the whitespace characters that are combined.

To match unicode whitespace:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str).strip()

To match ASCII whitespace only:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"(?a:\s+)")
_RE_STRIP_WHITESPACE = re.compile(r"(?a:^\s+|\s+$)")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str)
my_str = _RE_STRIP_WHITESPACE.sub("", my_str)

Matching only ASCII whitespace is sometimes essential for keeping control characters such as x0b, x0c, x1c, x1d, x1e, x1f.

Reference:

About \s:

For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

About re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

strip() will remote any leading and trailing whitespaces.

edited Mar 11, 2020 at 9:14

jaumebonet

2,2861 gold badge17 silver badges19 bronze badges

answered Jan 16, 2010 at 15:46

hroest

2,1831 gold badge12 silver badges7 bronze badges

2 Comments

Simon Hessner Over a year ago

If you really only want to replace spaces (' '), use re.sub(' +', ' ', mystring).strip()

mike rodent Feb 16 at 15:01

I don't really understand what's happening with this question, which appears no longer to be accepting answers. The most obvious answer of all (as put forward in a now deleted question from 2010) is re.sub(r'\s+', ' ', mystring) ... this succintly does what the OP asked (to be fair, the question is not well expressed: you can't "replace with a whitespace", but you can "replace with a space").

Community · Accepted Answer · 2017-05-23 11:47:21Z

49

For completeness, you can also use:

mystring = mystring.strip()  # the while loop will leave a trailing space, 
                  # so the trailing whitespace must be dealt with
                  # before or after the while loop
while '  ' in mystring:
    mystring = mystring.replace('  ', ' ')

which will work quickly on strings with relatively few spaces (faster than re in these situations).

In any scenario, Alex Martelli's split/join solution performs at least as quickly (usually significantly more so).

In your example, using the default values of timeit.Timer.repeat(), I get the following times:

str.replace: [1.4317800167340238, 1.4174888149192384, 1.4163512401715934]
re.sub:      [3.741931446594549,  3.8389395858970374, 3.973777672860706]
split/join:  [0.6530919432498195, 0.6252146571700905, 0.6346594329726258]

EDIT:

Just came across this post which provides a rather long comparison of the speeds of these methods.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Apr 10, 2013 at 22:23

David C

7,5847 gold badges50 silver badges66 bronze badges

2 Comments

BuvinJ Over a year ago

More lines than the others, and thus less "pythonic", but clearer.

林果皞 Over a year ago

A reminder, this one has the risk of being infinite loop if you typo.

Collectives™ on Stack Overflow

Substitute multiple whitespace with single whitespace in Python [duplicate]

3 Answers 3

9 Comments

Reference:

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

Reference:

2 Comments

2 Comments

Linked

Related