2

I am trying to breakup/split a string into words.

    def breakup(text):
        temp = []
        temp = re.split('\W+', text.rstrip())   
        return [e.lower() for e in temp]

Example Strings:

What's yellow, white, green and bumpy? A pickle wearing a tuxedo

Result:

['what', 's', 'yellow', 'white', 'green', 'and', 'bumpy', 'a', 'pickle', 'wearing', 'a', 'tuxedo']

but when i pass a string like

How is a locksmith like a typewritter? They both have a lot of keys!

['how', 'is', 'a', 'locksmith', 'like', 'a', 'typewritter', 'they', 'both', 'have', 'a', 'lot', 'of', 'keys', '']

I want to parse in a way that it doesn't get empty string in the list.

The string passed will have punctuation etc. Any ideas.

6 Answers 6

5

How about searching for what you want:

[ s.lower() for s in
  re.findall(r'\w+',
    "How is a locksmith like a typewritter? They both have a lot of keys!") ]

Or to build just one list:

[ s.group().lower() for s in
  re.finditer(r'\w+',
    "How is a locksmith like a typewritter? They both have a lot of keys!") ]
Sign up to request clarification or add additional context in comments.

6 Comments

+1 This is simpler than splitting and then filtering.
Yeah, I found the \W strange. But maybe OP needs to split instead of search (that \W could have been just a simplified example).
findall seems indeed more appropriate than split in this case.
The intended result is a list containg tokenized or parsed text, so why is finall better than split?
I find it more readable (but that's a matter of taste, of course). Actually, your thoughts should be focused on the words you want to process, not the not-words between them. Have the code reflect that, and the next developer has a less hard time to understand your code.
|
4

Just change

return [e.lower() for e in temp]

to

return [e.lower() for e in temp if e]

Also, the line

temp = []

is not needed, since you never use the empty list you asign to temp

4 Comments

Thanks for your answer. I am wondering should i choose yours or JaredPar. As he mentioned len for empty string?
There's no need to use len, since an empty string is falsy (is this even a word?) by itself. I recommend Alfe's approach, using findall instead of split. Use should accept his answer IMHO.
The intended result is a list containing tokenized or parsed text, so why is finall better than split?
I jsut think it is a little bit clearer since you don't need the filtering after running the regex (unless you want to split on other things than just whitespace). OTHO, the code is quite simple and it probably boils down to your personal taste which one to use.
2

This works:

txt='''\
What's yellow, white, green and bumpy? A pickle wearing a tuxedo
How is a locksmith like a typewritter? They both have a lot of keys!'''

import re

for line in txt.splitlines():
    print [word.lower() for word in re.findall(r'\w+', line) if word.strip()]

Prints:

['what', 's', 'yellow', 'white', 'green', 'and', 'bumpy', 'a', 'pickle', 'wearing', 'a', 'tuxedo']
['how', 'is', 'a', 'locksmith', 'like', 'a', 'typewritter', 'they', 'both', 'have', 'a', 'lot', 'of', 'keys']

Comments

1

Why not just check for this in the list comprehension

return [e.lower() for e in temp if len(e) > 0]

Or for the pedantic out there

return [e.lower() for e in temp if e]

8 Comments

"if len(e) > 0" is not pythonic, "if e" would be enough
@DmitryVakhrushev given the question I wonder if the newbie might benefit from understanding what is going on under the hood by being more explicit
It can be pointed in comment. Teaching newbies bad practices is a bad practice.
@DmitryVakhrushev i fail to see how len(e) is a bad practice. It's more explicit about what is happening here but in no way incorrect
JaredPar, I will accept Dominic Kexel's answer. I hope you wont mind. Thanks for your answer and clarification. He also mentioned about empty list etc.
|
1

you could do:

'How is a locksmith <blah> a lot of keys!'.rstrip('!?.').split()

Comments

0

In your particular case it would be:

def breakup(text):
    temp = []
    temp = re.split('\W+', text.rstrip())   
    return [e.lower() for e in temp if e]

More general solution is:

>>> re.findall('\w+', 'How is a locksmith like a typewritter? They both have a lot of keys!') 
>>> ['How', 'is', 'a', 'locksmith', 'like', 'a', 'typewritter', 'They', 'both', 'have', 'a', 'lot', 'of', 'keys']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.