Remove whitespace in Python using string.whitespace

Question

Python's string.whitespace is great:

>>> string.whitespace
'\t\n\x0b\x0c\r '

How do I use this with a string without resorting to manually typing in '\t|\n|... etc for regex?

For example, it should be able to turn: "Please \n don't \t hurt \x0b me."

into

"Please don't hurt me."

I'd probably want to keep the single spaces, but it'd be easy enough to just go string.whitespace[:-1] I suppose.

bobince · Accepted Answer · 2009-12-14 03:59:41Z

148

There is a special-case shortcut for exactly this use case!

If you call str.split without an argument, it splits on runs of whitespace instead of single characters. So:

>>> ' '.join("Please \n don't \t hurt \x0b me.".split())
"Please don't hurt me."

answered Dec 14, 2009 at 3:59

bobince

538k111 gold badges675 silver badges846 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Tor Valamo Over a year ago

That is infinately better than my solution. I also hope to become immortal one day.

Alex Over a year ago

Wow. That is amazing. Perfect for what I'm doing, since they're small strings. I wonder how this would perform on large datasets though? It'd be great if anyone knows how it works intrinsicly :)

MattoTodd Over a year ago

thanks, didn't know about using no argument for runs of whitespace. Huge!!

Kos Over a year ago

This is still faster than regex for a 20MB string.

bobince Over a year ago

@Dominique: yes, it's a documented stdlib feature—“If sep is not specified or is None, a different splitting algorithm is applied...”—that's widely used and not likely to be deprecated.

|

Imran · Accepted Answer · 2009-12-14 04:29:39Z

14

What's wrong with the \s character class?

>>> import re

>>> pattern = re.compile(r'\s+')
>>> re.sub(pattern, ' ', "Please \n don't \t hurt \x0b me.")
"Please don't hurt me."

answered Dec 14, 2009 at 4:29

Imran

91.7k26 gold badges101 silver badges132 bronze badges

3 Comments

Alex Over a year ago

Nothing, good solution. I think the .join/split option is pretty neat though, don't you think? :)

Imran Over a year ago

Indeed. In fact, timeit shows join/split to be it's 6 times faster than re.sub() for your given string.

Christophe Roussy Over a year ago

I suppose once compiled and sub reused multiple times this could be fast too

John Machin · Accepted Answer · 2009-12-14 20:26:51Z

9

Let's make some reasonable assumptions:

(1) You really want to replace any run of whitespace characters with a single space (a run is of length 1 or greater).

(2) You would like the same code to work with minimal changes under Python 2.X with unicode objects.

(3) You don't want your code to assume things that are not guaranteed in the docs

(4) You would like the same code to work with minimal changes with Python 3.X str objects.

The currently selected answer has these problems:

(a) changes " " * 3 to " " * 2 i.e. it removes duplicate spaces but not triplicate, quadruplicate, etc spaces. [fails requirement 1]

(b) changes "foo\tbar\tzot" to "foobarzot" [fails requirement 1]

(c) when fed a unicode object, gets TypeError: translate() takes exactly one argument (2 given) [fails requirement 2]

(d) uses string.whitespace[:-1] [fails requirement 3; order of characters in string.whitespace is not guaranteed]

(e) uses string.whitespace[:-1] [fails requirement 4; in Python 2.X, string.whitespace is '\t\n\x0b\x0c\r '; in Python 3.X, it is ' \t\n\r\x0b\x0c']

The " ".join(s.split()) answer and the re.sub(r"\s+", " ", s) answer don't have these problems.

edited Dec 14, 2009 at 20:26

answered Dec 14, 2009 at 10:09

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

1 Comment

Alex Over a year ago

Hey, you raise some great points. For me, the ' '.join(s.split()) works on the "foo\tbar\tzot" test! I mean, the original answer worked for me, but that's only because I'm not expecting such weird strings. However something that deals with this would be great. I just tested the sub with "foo\tbar\tzot" and it works... so I guess I'm just choosing the ' '.join(s.split()) version due to its simplicity and being able to work without importing the re module. Also my datasets are small, so I'm not worried about performance issues, if there were any.

Tor Valamo · Accepted Answer · 2009-12-14 03:18:34Z

2

You could use the translate method

import string

s = "Please \n don't \t hurt \x0b me."
s = s.translate(None, string.whitespace[:-1]) # python 2.6 and up
s = s.translate(string.maketrans('',''), string.whitespace[:-1]) # python 2.5, dunno further down
>>> s
"Please  don't  hurt  me."

And then remove duplicate whitespace

s.replace('  ', ' ')
>>> s
"Please don't hurt me."

edited Dec 14, 2009 at 3:18

answered Dec 14, 2009 at 2:58

Tor Valamo

33.9k11 gold badges75 silver badges82 bronze badges

3 Comments

Tor Valamo Over a year ago

see the edit. also, which python version are you using? you need 2.6 for the None argument to work.

Alex Over a year ago

Yeah, I'm using 2.5... is there an alternative for None? Otherwise I'll have to use the other answer...

Alex Over a year ago

Nice, thanks very much! This is the best answer now, especially since it caters for my 2.5-ness.

miku · Accepted Answer · 2009-12-14 03:03:21Z

1

a starting point .. (although it's not shorter than manually assembling the whitespace circus) ..

>>> from string import whitespace as ws
>>> import re

>>> p = re.compile('(%s)' % ('|'.join([c for c in ws])))
>>> s = "Please \n don't \t hurt \x0b me."

>>> p.sub('', s)
"Pleasedon'thurtme."

Or if you want to reduce whitespace to a maximum of one:

>>> p1 = re.compile('(%s)' % ('|'.join([c for c in ws if not c == ' '])))
>>> p2 = re.compile(' +')
>>> s = "Please \n don't \t hurt \x0b me."

>>> p2.sub(' ', p1.sub('', s))
"Please don't hurt me."

Third way, more compact:

>>> import string

>>> s = "Please \n don't \t hurt \x0b me."
>>> s.translate(None, string.whitespace[])
"Pleasedon'thurtme."

>>> s.translate(None, string.whitespace[:5])
"Please  don't  hurt  me."

>>> ' '.join(s.translate(None, string.whitespace[:5]).split())
"Please don't hurt me."

edited Dec 14, 2009 at 3:03

answered Dec 14, 2009 at 2:49

miku

189k47 gold badges314 silver badges317 bronze badges

1 Comment

Alex Over a year ago

I originally had this as the first answer; it was a nice solution and good use of python simplicity :)

Collectives™ on Stack Overflow

Remove whitespace in Python using string.whitespace

5 Answers 5

7 Comments

3 Comments

1 Comment

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

3 Comments

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related