Python split by multiple separators, including space?

Question

Input:

Some Text here: Java, PHP, JS, HTML 5, CSS, Web, C#, SQL, databases, AJAX, etc.

Code:

import re

input_words = list(re.split('\s+', input()))
print(input_words)

Works perfect and returns me:

['Some', 'Text', 'here:', 'Java,', 'PHP,', 'JS,', 'HTML', '5,', 'CSS,', 'Web,', 'C#,', 'SQL,', 'databases,', 'AJAX,', 'etc.']

But when add some other separators, like this:

import re

input_words = list(re.split('\s+ , ; : . ! ( ) " \' \ / [ ] ', input()))
print(input_words)

It doesn't split by spaces anymore, where am I wrong?

Expected outpus would be:

['Some', 'Text', 'here', 'Java', 'PHP', 'JS', 'HTML', '5', 'CSS', 'Web', 'C#', 'SQL', 'databases', 'AJAX', 'etc']

Tim Biegeleisen · Accepted Answer · 2019-05-27 10:50:43Z

6

You should be splitting on a regex alternation containing all those symbols:

input_words = re.split('[\s,;:.!()"\'\\\[\]]', input())
print(input_words)

This is a literal answer to your question. The actual solution you might want to use would be to split on the symbols with optional whitespace on either end, e.g

input = "A B ; C.D   ! E[F] G"
input_words = re.split('\s*[,;:.!()"\'\\\[\]]?\s*', input)
print(input_words)

Prints:

['A', 'B', 'C', 'D', 'E', 'F', 'G']

edited May 27, 2019 at 10:50

answered May 27, 2019 at 10:41

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MorganFreeFarm Over a year ago

It returns me empty elements, probably trim after all ?

MorganFreeFarm Over a year ago

First solution works with list(filter(None, input_words)) this row also.

Tim Biegeleisen Over a year ago

Without seeing all your sample data, we can only guess, but see my updated answer for a possible solution.

Tim Biegeleisen Over a year ago

Then my second suggestion is probably along the lines of what you should be doing, namely splitting on symbols with whitespace before/after them.

MorganFreeFarm Over a year ago

It's edit with expected output, so your first solution + list(filter(None, input_words)) works perfect.

Asif · Accepted Answer · 2019-05-27 12:26:27Z

1

write the expression inside brackets as shown below. Hope it helps

import re



input_words = list(re.split('[\s+,:.!()]', input()))

answered May 27, 2019 at 12:26

Asif

2373 silver badges5 bronze badges

Comments

Nagaraju · Accepted Answer · 2019-05-27 10:51:39Z

0

Word tokenization using nltk module

#!/usr/bin/python3
import nltk

sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
words = nltk.tokenize.word_tokenize(sentence)
print(words)

output:

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

answered May 27, 2019 at 10:51

Nagaraju

1,8852 gold badges28 silver badges49 bronze badges

Collectives™ on Stack Overflow

Python split by multiple separators, including space?

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related