Python regex: splitting on pattern match that is an empty string

Question

With the re module, it seems that I am unable to split on pattern matches that are empty strings:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
['foobarbarbazbar']

In other words, even if a match is found, if it's the empty string, even re.split cannot split the string.

The docs for re.split seem to support my results.

A "workaround" was easy enough to find for this particular case:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')
['foobar', 'barbaz', 'bar']

But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I'm splitting on:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')
['foobar', 'bar', '', 'bar']

Is there any better way to split on an empty pattern match with the re module? Additionally, why does re.split not allow me to do this in the first place? I know it's possible with other split algorithms that work with regex; for example, I am able to do this with JavaScript's built-in String.prototype.split().

Antti Haapala · Accepted Answer · 2017-11-24 09:59:49Z

10

It is unfortunate that the split requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]*as the regex. Use of such patterns will now generate a FutureWarning and those that never can split anything, throw a ValueError from Python 3.5 onwards:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/re.py", line 212, in split
    return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.

The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again.

If you can't use the regex module, you can write your own split function using re.finditer():

def megasplit(pattern, string):
    splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
    starts = [0] + [i[1] for i in splits]
    ends = [i[0] for i in splits] + [len(string)]
    return [string[start:end] for start, end in zip(starts, ends)]

print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
print(megasplit(r'o', 'foobarbarbazbar'))

If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code:

import re

def zerowidthsplit(pattern, string):
    splits = list(m.start() for m in re.finditer(pattern, string))
    starts = [0] + splits
    ends = splits + [ len(string) ]
    return [string[start:end] for start, end in zip(starts, ends)]

print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))

edited Nov 24, 2017 at 9:59

answered May 1, 2015 at 15:13

Antti Haapala

135k23 gold badges297 silver badges349 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Shashank Over a year ago

While the findall method in the other answer is clever, it requires the "foo" pattern to be repeated twice in the same regex. If "foo" were actually a placeholder for a much more complicated pattern, that would be entirely undesirable. This answer is the most scalable and practical for complicated regular expressions and it also doesn't require any additional modules to be installed (which also takes away the necessity to refactor existing code to work with regex), and that's why I'm accepting this as the best answer.

Antti Haapala Over a year ago

@Shashank added a split function that works correctly with zero-width and non-zero-width matches

Eric Duminil Over a year ago

How could incorrect code rely on something which isn't implemented? There are very few areas for which Python objectively sucks, and this one fine example.

Antti Haapala Over a year ago

@EricDuminil the single example is using [something]* as separator. In any case it is being fixed.

Shashank · Accepted Answer · 2015-05-01 18:20:05Z

4

import regex
x="bazbarbarfoobar"
print regex.split(r"(?<!baz)(?=bar)",x,flags=regex.VERSION1)

You can use regex module here for this.

or

(.+?(?<!foo))(?=bar|$)|(.+?foo)$

Use re.findall .

See demo

edited May 1, 2015 at 18:20

Shashank

13.9k5 gold badges39 silver badges63 bronze badges

answered May 1, 2015 at 14:20

vks

68.1k11 gold badges96 silver badges132 bronze badges

7 Comments

Shashank Over a year ago

You mean the module on PyPI that is supposed to replace re in the future?

Shashank Over a year ago

I had to Google it because your answer didn't have a link. :p But that is nice to know about. Any idea when the replacement is scheduled?

Shashank Over a year ago

Well foo would need to be in the capture group, so I fixed it like this: re.findall(r'(.+?(?<!foo)|.*?foo(?!bar))(?=bar|$)', 'foo'). I think that works properly. It allows the capture group to end in foo if the negative lookahead says that it's not preceded by bar.

vks Over a year ago

@Shashank seems to be working fine!!! (.+?(?<!foo))(?=bar|$)|(.*?foo)$ guess both would work the same!!!

Shashank Over a year ago

Ah I see, it's giving [('', 'foo')] because findall returns tuples when you have multiple capture groups tied together with an alternation operator. Which is undesired...so I think my method is the best since it has only one capture group.

|

Collectives™ on Stack Overflow

Python regex: splitting on pattern match that is an empty string

2 Answers 2

4 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related