1

I define a split function as lambda x: re.split('[(|)|.]', x), and when I applied this function to my original strings, it always generates some empty strings. For example:

When applied to string:

(Type).(Terrorist organization)AND(Involved in attacks).(nine-eleven)

The result is:

['', 'Type', '', '', 'Terrorist organization', 'AND', 'Involved in attacks', '', '', 'nine-eleven', '']

I know I can simply remove those empty strings manually, but is there any smart way to get rid of them?

6
  • That depends on what you want. What Output do wish to get? Commented Oct 4, 2019 at 19:53
  • Why do you have | multiple times in the [] character set? Commented Oct 4, 2019 at 19:53
  • Are you trying to capture characters within pairs of parentheses? The use of square brackets [..] means match any character inside. You can’t use capture groups inside them. [().] will match the literal characters present. Commented Oct 4, 2019 at 19:54
  • you could filter empty strings afterwards with a post processing listcomp Commented Oct 4, 2019 at 19:55
  • Each individual character ((, |, ), .) inside the square brackets is considered a separate delimiter. Commented Oct 4, 2019 at 19:56

3 Answers 3

1

grab as many separators as you can with + instead of only one:

re.split('[().]+', s)

unfortunately, this doesn't suffice as re.split notoriously yields empty strings at start & end of the string:

['', 'Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven', '']

but you can filter them out by using post processing:

[x for x in re.split('[().]+', s) if x]

On the other hand, you could revert the regex and use re.findall to match as much non-separators as possible:

re.findall('[^().]+', s)

this directly yields:

['Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven']
Sign up to request clarification or add additional context in comments.

Comments

1

The regexp matches ), ., and ( individually. Since these are next to each other in the input, there's an empty string between them, so the result contains those empty strings.

If you want to treat a sequence of delimiters as a single delimiter, add a + quantifier to the regexp so it matches them as a sequence.

re.split('[|().]+', x)

The empty string at the beginning is because of the empty string before the first (. Similarly, the empty string at the end is from the empty string in the input after the last ). I don't think there's a simple way to prevent these, just remove them from the result.

Comments

1

You can filter:

filter(lambda x: x, re.split('[().]+', s))

Test:

import re
s = '(Type).(Terrorist organization)AND(Involved in attacks).(nine-eleven)'
print(list(filter(None, re.split('[().]+', s))))

Result:

['Type', 'Terrorist organization', 'AND', 'Involved in attacks', 'nine-eleven']

1 Comment

filter(lambda x: x, is slightly redundant. You can write filter(None,

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.