4

I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.

So for instance the text may be:

hello world this is John. or

hello world this is John or even

hello world, this, is John

How can I efficiently parse that text into the following list?

['hello', 'world', 'this', 'is', 'John']

Thanks in advance.

4
  • Tried r'/\s+/g' yet? Commented Apr 29, 2014 at 10:22
  • possible duplicate of Split string on whitespace in python Commented Apr 29, 2014 at 10:23
  • Problem is I don't know if the user will use commas or whitespaces. Therefore I need a solution to cover it all. Commented Apr 29, 2014 at 10:25
  • My bad, didn't see the commas. The title is kind of misleading. Have you looked into re.split? Where is your current attempt failing? Commented Apr 29, 2014 at 10:28

3 Answers 3

5

Use the regular expression: r'[\s,]+' to split on 1 or more white-space characters (\s) or commas (,).

import re

s = 'hello world,    this, is       John'
print re.split(r'[\s,]+', s)

['hello', 'world', 'this', 'is', 'John']

Sign up to request clarification or add additional context in comments.

Comments

3

Since you need to split based on spaces and other special characters, the best RegEx would be \W+. Quoting from Python re documentation

\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

For Example,

data = "hello world,    this, is       John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']

Or, if you have the list of special characters by which the string has to be split, you can do

print re.split("[\s,]+", data)

This splits based on any whitespace character (\s) and comma (,).

3 Comments

Thank you. Clean and effective solution. However not that only print re.split("[\s,]+", data) worked. Maybe it's the fact that I'm under Windows.
Yes. \W+ method returned an empty list for me. However the re.split method worked perfectly good.
@Konos5 I actually tested it before posting here. So, if you could help me reproduce the problem with some sample data, it would be good :)
1
>>> s = "hello      world this     is            John"
>>> s.split()
['hello', 'world', 'this', 'is', 'John']
>>> s = "hello world, this, is John"
>>> s.split()
['hello', 'world,', 'this,', 'is', 'John']

The first one is correctly parsed by split with no arguments ;)

Then you can :

>>> s = "hello world, this, is John"
>>> def notcoma(ss) :
...     if ss[-1] == ',' :
...             return ss[:-1]
...     else :
...             return ss
... 
>>> map(notcoma, s.split())
['hello', 'world', 'this', 'is', 'John']

1 Comment

He has to split based on special characters as well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.