Why do I get those empty strings when using re.split() in python?

Question

I define a split function as lambda x: re.split('[(|)|.]', x), and when I applied this function to my original strings, it always generates some empty strings. For example:

When applied to string:

(Type).(Terrorist organization)AND(Involved in attacks).(nine-eleven)

The result is:

['', 'Type', '', '', 'Terrorist organization', 'AND', 'Involved in attacks', '', '', 'nine-eleven', '']

I know I can simply remove those empty strings manually, but is there any smart way to get rid of them?

Why do you have | multiple times in the [] character set? — Barmar
– Barmar, Commented Oct 4, 2019 at 19:53
Are you trying to capture characters within pairs of parentheses? The use of square brackets [..] means match any character inside. You can’t use capture groups inside them. [().] will match the literal characters present. — N Chauhan
– N Chauhan, Commented Oct 4, 2019 at 19:54
you could filter empty strings afterwards with a post processing listcomp — Jean-François Fabre
– Jean-François Fabre ♦, Commented Oct 4, 2019 at 19:55
Each individual character ((, |, ), .) inside the square brackets is considered a separate delimiter. — chepner
– chepner, Commented Oct 4, 2019 at 19:56

Jean-François Fabre · Accepted Answer · 2019-10-04 19:59:32Z

1

grab as many separators as you can with + instead of only one:

re.split('[().]+', s)

unfortunately, this doesn't suffice as re.split notoriously yields empty strings at start & end of the string:

['', 'Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven', '']

but you can filter them out by using post processing:

[x for x in re.split('[().]+', s) if x]

On the other hand, you could revert the regex and use re.findall to match as much non-separators as possible:

re.findall('[^().]+', s)

this directly yields:

['Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven']

answered Oct 4, 2019 at 19:59

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Barmar · Accepted Answer · 2019-10-04 19:59:15Z

1

The regexp matches ), ., and ( individually. Since these are next to each other in the input, there's an empty string between them, so the result contains those empty strings.

If you want to treat a sequence of delimiters as a single delimiter, add a + quantifier to the regexp so it matches them as a sequence.

re.split('[|().]+', x)

The empty string at the beginning is because of the empty string before the first (. Similarly, the empty string at the end is from the empty string in the input after the last ). I don't think there's a simple way to prevent these, just remove them from the result.

answered Oct 4, 2019 at 19:59

Barmar

789k57 gold badges554 silver badges669 bronze badges

Comments

JacobIRR · Accepted Answer · 2019-10-04 21:17:35Z

1

You can filter:

filter(lambda x: x, re.split('[().]+', s))

Test:

import re
s = '(Type).(Terrorist organization)AND(Involved in attacks).(nine-eleven)'
print(list(filter(None, re.split('[().]+', s))))

Result:

['Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven']

edited Oct 4, 2019 at 21:17

answered Oct 4, 2019 at 20:00

JacobIRR

9,0468 gold badges47 silver badges73 bronze badges

1 Comment

Jean-François Fabre Over a year ago

filter(lambda x: x, is slightly redundant. You can write filter(None,

Collectives™ on Stack Overflow

Why do I get those empty strings when using re.split() in python?

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related