Split a string using a list of strings as a pattern

Question

Consider an input string :

mystr = "just some stupid string to illustrate my question"

and a list of strings indicating where to split the input string:

splitters = ["some", "illustrate"]

The output should look like

result = ["just ", "some stupid string to ", "illustrate my question"]

I wrote some code which implements the following approach. For each of the strings in splitters, I find its occurrences in the input string, and insert something which I know for sure would not be a part of my input string (for example, this '!!'). Then I split the string using the substring that I just inserted.

for s in splitters:
    mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)

result = re.split('!!', mystr)

This solution seems ugly, is there a nicer way of doing it?

hlt · Accepted Answer · 2014-08-20 22:28:27Z

Splitting with re.split will always remove the matched string from the output (NB, this is not quite true, see the edit below). Therefore, you must use positive lookahead expressions ((?=...)) to match without removing the match. However, re.split ignores empty matches, so simply using a lookahead expression doesn't work. Instead, you will lose one character at each split at minimum (even trying to trick re with "boundary" matches (\b) does not work). If you don't care about losing one whitespace / non-word character at the end of each item (assuming you only split at non-word characters), you can use something like

re.split(r"\W(?=some|illustrate)")

which would give

["just", "some stupid string to", "illustrate my question"]

(note that the spaces after just and to are missing). You could then programmatically generate these regexes using str.join. Note that each of the split markers is escaped with re.escape so that special characters in the items of splitters do not affect the meaning of the regular expression in any undesired ways (imagine, e.g., a ) in one of the strings, which would otherwise lead to a regex syntax error).

the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))

Edit (HT to @Arkadiy): Grouping the actual match, i.e. using (\W) instead of \W, returns the non-word characters inserted into the list as seperate items. Joining every two subsequent items would then produce the list as desired as well. Then, you can also drop the requirement of having a non-word character by using (.) instead of \W:

the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]

Because normal text and auxiliary character alternate, the_split[::2] contains the normal split text and the_split[1::2] the auxiliary characters. Then, itertools.izip_longest is used to combine each text item with the corresponding removed character and the last item (which is unmatched in the removed characters)) with fillvalue, i.e. ''. Then, each of these tuples is joined using "".join(x). Note that this requires itertools to be imported (you could of course do this in a simple loop, but itertools provides very clean solutions to these things). Also note that itertools.izip_longest is called itertools.zip_longest in Python 3.

This leads to further simplification of the regular expression, because instead of using auxiliary characters, the lookahead can be replaced with a simple matching group ((some|interesting) instead of (.)(?=some|interesting)):

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

Here, the slice indices on the_raw_split have swapped, because now the even-numbered items must be added to item afterwards instead of in front. Also note the [""] + part, which is necessary to pair the first item with "" to fix the order.

(end of edit)

Alternatively, you can (if you want) use string.replace instead of re.sub for each splitter (I think that is a matter of preference in your case, but in general it is probably more efficient)

for s in splitters:
    mystr = mystr.replace(s, "!!" + s)

Also, if you use a fixed token to indicate where to split, you do not need re.split, but can use string.split instead:

result = mystr.split("!!")

What you could also do (instead of relying on the replacement token not to be in the string anywhere else or relying on every split position being preceded by a non-word character) is finding the split strings in the input using string.find and using string slicing to extract the pieces:

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

Here, [i for i in (string.find(s) for s in splitters) if i > 0] generates a list of positions where the splitters can be found, for all splitters that are in the string (for this, i < 0 is excluded) and not right at the beginning (where we (possibly) just split, so i == 0 is excluded as well). If there are any left in the string, we yield (this is a generator function) everything up to (excluding) the first splitter (at min(split_positions)) and replace the string with the remaining part. If there are none left, we yield the last part of the string and exit the function. Because this uses yield, it is a generator function, so you need to use list to turn it into an actual list.

Note that you could also replace yield whatever with a call to some_list.append (provided you defined some_list earlier) and return some_list at the very end, I do not consider that to be very good code style, though.

TL;DR

If you are OK with using regular expressions, use

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

else, the same can also be achieved using string.find with the following split function:

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

you should escape your split words with re.escape before joining the regular expression.
I wonder if "(\W)(?=some|illustrate)" and then concatenating every two elements of the list together may work... "If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list"
@Arkadiy Yes, that would work (adding that now...). Thanks for pointing that out.
Hahaha, such a rigorous answer! Thank you! I should probably have mentioned that white spaces were not of a big importance in my particular application, but I definitely learned a lot from your answer, and now I know many ways of solving the problem ;)
@ojy I know, that escalated a little while I was working on it... Feel free to mark it as an accepted answer if it helped you

Alex Riley · Accepted Answer · 2014-08-20 21:49:17Z

4

Not especially elegant but avoiding regex:

mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))

print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']

I should acknowledge here that a little more work is needed if a word in splitters occurs more than once because str.index finds only the location of the first occurrence of the word...

edited Aug 20, 2014 at 21:49

answered Aug 20, 2014 at 19:45

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

2 Comments

ojy Over a year ago

Yea, there's also a problem when not all the words from splitters are in the string. But I like the mystr.index(), I didn't know about it, +1!

Alex Riley Over a year ago

Thanks! That's true - it will raise ValueError if not found. One could get around it by using str.find instead: that will return -1 instead if a word in splitters is not in mystr (-1 will then need to be removed from indexes, e.g. set(indexes) - set([-1]))

Collectives™ on Stack Overflow

Split a string using a list of strings as a pattern

2 Answers 2

TL;DR

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

TL;DR

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related