2

Input:

Some Text here: Java, PHP, JS, HTML 5, CSS, Web, C#, SQL, databases, AJAX, etc.

Code:

import re

input_words = list(re.split('\s+', input()))
print(input_words)

Works perfect and returns me:

['Some', 'Text', 'here:', 'Java,', 'PHP,', 'JS,', 'HTML', '5,', 'CSS,', 'Web,', 'C#,', 'SQL,', 'databases,', 'AJAX,', 'etc.']

But when add some other separators, like this:

import re

input_words = list(re.split('\s+ , ; : . ! ( ) " \' \ / [ ] ', input()))
print(input_words)

It doesn't split by spaces anymore, where am I wrong?

Expected outpus would be:

['Some', 'Text', 'here', 'Java', 'PHP', 'JS', 'HTML', '5', 'CSS', 'Web', 'C#', 'SQL', 'databases', 'AJAX', 'etc']

3 Answers 3

6

You should be splitting on a regex alternation containing all those symbols:

input_words = re.split('[\s,;:.!()"\'\\\[\]]', input())
print(input_words)

This is a literal answer to your question. The actual solution you might want to use would be to split on the symbols with optional whitespace on either end, e.g

input = "A B ; C.D   ! E[F] G"
input_words = re.split('\s*[,;:.!()"\'\\\[\]]?\s*', input)
print(input_words)

Prints:

['A', 'B', 'C', 'D', 'E', 'F', 'G']
Sign up to request clarification or add additional context in comments.

5 Comments

It returns me empty elements, probably trim after all ?
First solution works with list(filter(None, input_words)) this row also.
Without seeing all your sample data, we can only guess, but see my updated answer for a possible solution.
Then my second suggestion is probably along the lines of what you should be doing, namely splitting on symbols with whitespace before/after them.
It's edit with expected output, so your first solution + list(filter(None, input_words)) works perfect.
1

write the expression inside brackets as shown below. Hope it helps

import re



input_words = list(re.split('[\s+,:.!()]', input()))

Comments

0

Word tokenization using nltk module

#!/usr/bin/python3
import nltk

sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
words = nltk.tokenize.word_tokenize(sentence)
print(words)

output:

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.