0

In the following Python regular expression, splitting words from a string into a list of substrings, I'm trying to avoid the empty string, '', on the output. Can I adjust the inputs of the regular expression to achieve this?

In [1]: import re
In [2]: re.split(r'[.!,;\s]\s*', 'To be or not! to be; that, is the question!')
Out [2]: ['To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question', '']

Resulted in unwanted empty string at end of list apply re.split method.

6
  • can't you just re.split(...)[:-1]? Commented May 19 at 14:48
  • 3
    @KenzoStaelens This is not robust. For example, it fails for 'To be or not! to be; that, is the question' (no trailing punctuation, so the last element is also desired). Commented May 19 at 14:51
  • 2
    Is splitting the string what you actually looking to do here (XY-Problem) ? Or is splitting just a way to come up with finding all words in the string? If you want do findall words in the string you can use re.findall(r'\w+',s) where s is your string. Commented May 19 at 15:29
  • re.findall(r'\w+',s) works in this case, but it will fail to keep together a hyphenated word, like weather-beaten. Is there a modification that can be made to not split a word if there are letters (no spaces) on both sides of a hyphen? Commented May 19 at 19:10
  • @LeeMcNally Updated the answer based on the comments. Commented May 19 at 20:41

1 Answer 1

2

You can remove the undesired strings by placing the results of re.split in a list comprehension with a conditional:

import re
lst = [s for s in re.split(r'[.!,;\s]\s*', 'To be or not! to be; that, is the question!') if s != '']
print(lst)

Output:

['To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']

Update 1

Q: re.findall(r'\w+',s) works in this case, but it will fail to keep together a hyphenated word, like weather-beaten. Is there a modification that can be made to not split a word if there are letters (no spaces) on both sides of a hyphen?

A: Use either (a) a combination of re.split and re.search, or, even better, (b) re.findall:

lst = [s for s in
       re.split(r"[.!,;\s]\s*", "To be - or not! to be; that, is the ready-made question!")
       if re.search(r"\w", s)]
print(lst)

lst = re.findall(r"\b(\w+-\w+|\w+)\b", "To be - or not! to be; that, is the ready-made question!")
print(lst)

In both cases, the output is:

['To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'ready-made', 'question']
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.