Python regex splitting on multiple whitespaces

Question

I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.

So for instance the text may be:

hello world this is John. or

hello world this is John or even

hello world, this, is John

How can I efficiently parse that text into the following list?

['hello', 'world', 'this', 'is', 'John']

Thanks in advance.

Problem is I don't know if the user will use commas or whitespaces. Therefore I need a solution to cover it all. — stratis
– stratis, Commented Apr 29, 2014 at 10:25
My bad, didn't see the commas. The title is kind of misleading. Have you looked into re.split? Where is your current attempt failing? — Robin
– Robin, Commented Apr 29, 2014 at 10:28

Mr. Polywhirl · Accepted Answer · 2014-04-29 10:30:43Z

5

Use the regular expression: r'[\s,]+' to split on 1 or more white-space characters (\s) or commas (,).

import re

s = 'hello world,    this, is       John'
print re.split(r'[\s,]+', s)

['hello', 'world', 'this', 'is', 'John']

edited Apr 29, 2014 at 10:30

answered Apr 29, 2014 at 10:24

Mr. Polywhirl

49.1k12 gold badges96 silver badges147 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

3

Since you need to split based on spaces and other special characters, the best RegEx would be \W+. Quoting from Python re documentation

\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

For Example,

data = "hello world,    this, is       John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']

Or, if you have the list of special characters by which the string has to be split, you can do

print re.split("[\s,]+", data)

This splits based on any whitespace character (\s) and comma (,).

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Apr 29, 2014 at 10:26

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

3 Comments

stratis Over a year ago

Thank you. Clean and effective solution. However not that only print re.split("[\s,]+", data) worked. Maybe it's the fact that I'm under Windows.

stratis Over a year ago

Yes. \W+ method returned an empty list for me. However the re.split method worked perfectly good.

thefourtheye Over a year ago

@Konos5 I actually tested it before posting here. So, if you could help me reproduce the problem with some sample data, it would be good :)

Guestar · Accepted Answer · 2014-04-29 10:35:11Z

1

>>> s = "hello      world this     is            John"
>>> s.split()
['hello', 'world', 'this', 'is', 'John']
>>> s = "hello world, this, is John"
>>> s.split()
['hello', 'world,', 'this,', 'is', 'John']

The first one is correctly parsed by split with no arguments ;)

Then you can :

>>> s = "hello world, this, is John"
>>> def notcoma(ss) :
...     if ss[-1] == ',' :
...             return ss[:-1]
...     else :
...             return ss
... 
>>> map(notcoma, s.split())
['hello', 'world', 'this', 'is', 'John']

edited Apr 29, 2014 at 10:35

answered Apr 29, 2014 at 10:27

Guestar

112 bronze badges

1 Comment

thefourtheye Over a year ago

He has to split based on special characters as well

Collectives™ on Stack Overflow

Python regex splitting on multiple whitespaces

3 Answers 3

Comments

\W

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

\W

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related