Parsing text with regular expression into list with empty string in result

Question

I am trying to breakup/split a string into words.

    def breakup(text):
        temp = []
        temp = re.split('\W+', text.rstrip())   
        return [e.lower() for e in temp]

Example Strings:

What's yellow, white, green and bumpy? A pickle wearing a tuxedo

Result:

['what', 's', 'yellow', 'white', 'green', 'and', 'bumpy', 'a', 'pickle', 'wearing', 'a', 'tuxedo']

but when i pass a string like

How is a locksmith like a typewritter? They both have a lot of keys!

['how', 'is', 'a', 'locksmith', 'like', 'a', 'typewritter', 'they', 'both', 'have', 'a', 'lot', 'of', 'keys', '']

I want to parse in a way that it doesn't get empty string in the list.

The string passed will have punctuation etc. Any ideas.

Alfe · Accepted Answer · 2014-01-27 16:04:12Z

5

How about searching for what you want:

[ s.lower() for s in
  re.findall(r'\w+',
    "How is a locksmith like a typewritter? They both have a lot of keys!") ]

Or to build just one list:

[ s.group().lower() for s in
  re.finditer(r'\w+',
    "How is a locksmith like a typewritter? They both have a lot of keys!") ]

edited Jan 27, 2014 at 16:04

answered Jan 27, 2014 at 15:55

Alfe

60.2k21 gold badges117 silver badges172 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

sloth Over a year ago

+1 This is simpler than splitting and then filtering.

Alfe Over a year ago

Yeah, I found the \W strange. But maybe OP needs to split instead of search (that \W could have been just a simplified example).

njzk2 Over a year ago

findall seems indeed more appropriate than split in this case.

jamesT Over a year ago

The intended result is a list containg tokenized or parsed text, so why is finall better than split?

Alfe Over a year ago

I find it more readable (but that's a matter of taste, of course). Actually, your thoughts should be focused on the words you want to process, not the not-words between them. Have the code reflect that, and the next developer has a less hard time to understand your code.

|

sloth · Accepted Answer · 2014-01-27 15:54:51Z

4

Just change

return [e.lower() for e in temp]

to

return [e.lower() for e in temp if e]

Also, the line

temp = []

is not needed, since you never use the empty list you asign to temp

answered Jan 27, 2014 at 15:54

sloth

101k21 gold badges182 silver badges224 bronze badges

4 Comments

jamesT Over a year ago

Thanks for your answer. I am wondering should i choose yours or JaredPar. As he mentioned len for empty string?

sloth Over a year ago

There's no need to use len, since an empty string is falsy (is this even a word?) by itself. I recommend Alfe's approach, using findall instead of split. Use should accept his answer IMHO.

jamesT Over a year ago

The intended result is a list containing tokenized or parsed text, so why is finall better than split?

sloth Over a year ago

I jsut think it is a little bit clearer since you don't need the filtering after running the regex (unless you want to split on other things than just whitespace). OTHO, the code is quite simple and it probably boils down to your personal taste which one to use.

dawg · Accepted Answer · 2014-01-27 15:56:50Z

2

This works:

txt='''\
What's yellow, white, green and bumpy? A pickle wearing a tuxedo
How is a locksmith like a typewritter? They both have a lot of keys!'''

import re

for line in txt.splitlines():
    print [word.lower() for word in re.findall(r'\w+', line) if word.strip()]

Prints:

['what', 's', 'yellow', 'white', 'green', 'and', 'bumpy', 'a', 'pickle', 'wearing', 'a', 'tuxedo']
['how', 'is', 'a', 'locksmith', 'like', 'a', 'typewritter', 'they', 'both', 'have', 'a', 'lot', 'of', 'keys']

answered Jan 27, 2014 at 15:56

dawg

105k24 gold badges142 silver badges217 bronze badges

Comments

JaredPar · Accepted Answer · 2014-01-27 15:55:40Z

1

Why not just check for this in the list comprehension

return [e.lower() for e in temp if len(e) > 0]

Or for the pedantic out there

return [e.lower() for e in temp if e]

answered Jan 27, 2014 at 15:55

JaredPar

759k152 gold badges1.3k silver badges1.5k bronze badges

8 Comments

Dmitry Vakhrushev Over a year ago

"if len(e) > 0" is not pythonic, "if e" would be enough

JaredPar Over a year ago

@DmitryVakhrushev given the question I wonder if the newbie might benefit from understanding what is going on under the hood by being more explicit

Dmitry Vakhrushev Over a year ago

It can be pointed in comment. Teaching newbies bad practices is a bad practice.

JaredPar Over a year ago

@DmitryVakhrushev i fail to see how len(e) is a bad practice. It's more explicit about what is happening here but in no way incorrect

jamesT Over a year ago

JaredPar, I will accept Dominic Kexel's answer. I hope you wont mind. Thanks for your answer and clarification. He also mentioned about empty list etc.

|

Colin Bernet · Accepted Answer · 2014-01-27 15:57:37Z

1

you could do:

'How is a locksmith <blah> a lot of keys!'.rstrip('!?.').split()

answered Jan 27, 2014 at 15:57

Colin Bernet

1,40410 silver badges12 bronze badges

Comments

Dmitry Vakhrushev · Accepted Answer · 2014-01-27 15:57:50Z

0

In your particular case it would be:

def breakup(text):
    temp = []
    temp = re.split('\W+', text.rstrip())   
    return [e.lower() for e in temp if e]

Collectives™ on Stack Overflow

Parsing text with regular expression into list with empty string in result

6 Answers 6

6 Comments

4 Comments

Comments

8 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

6 Comments

4 Comments

Comments

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related