Regular expression split unexpected empty string item

Question

I have strings like this example

"BODY: 88% RECYCLED POLYESTER, 12% ELASTANE GUSSET LINING: 91% COTTON, 9% ELASTANE EXCLUSIVE OF DECORATION"

And I want to split them so that a word with a colon starts a new list item, while keeping that colon word

["BODY: 77% RECYCLED POLYESTER, 23% ELASTANE", "MESH: 84% POLYAMIDE, 16% ELASTANE EXCLUSIVE OF DECORATION"]

I came up with

re.split("\s(\w+:.+)", p)

But this returns an empty string at the end and I'm not sure why

['BODY: 77% RECYCLED POLYESTER, 23% ELASTANE', 'MESH: 84% POLYAMIDE, 16% ELASTANE EXCLUSIVE OF DECORATION', '']

Does this answer your question? Split by regex without resulting empty strings in Python — charles
– charles, Commented May 11, 2021 at 22:52

ggorlen · Accepted Answer · 2021-05-11 23:29:54Z

You can use re.split(r"\s(?=\w+:)", s). I added a lookahead ?= to ensure the split occurs only on the space character that has the \w+: pattern following it.

The original attempt includes the entire pattern in the split group leading to undesirable results (if you include multiple word: groups, you'll see there are bigger problems than just the trailing empty string).

Here's a comparison:

>>> s = "foo: bar bar baz: asdfa sdfasd quux: zzzz"
>>> #                ^                 ^
>>> # we want to split on the highlighted space characters above
>>>
>>> re.split(r"\s(\w+:.+)", s) # incorrect
['foo: bar bar', 'baz: asdfa sdfasd quux: zzzz', '']
>>> re.split(r"\s(?=\w+:)", s) # correct
['foo: bar bar', 'baz: asdfa sdfasd', 'quux: zzzz']

If you want to handle splitting on multiple spaces, you can use r"\s+(?=\w+:)".

Note also raw strings should be used for all regex literals to ensure nothing is inadvertently escaped.

Collectives™ on Stack Overflow

Regular expression split unexpected empty string item

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related