Splitting longer patterns using regex without losing characters Python 3+

Question

My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:

re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)

I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.

"Is this the number 3? The text goes on..."

will look like

"Is this the number " and "he text goes on..."

Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?

Have you considered using lookarounds to find the space on which you actually want to split? — jonrsharpe
– jonrsharpe, Commented May 18, 2015 at 13:47
@jonrsharpe: that only works if there is at least one character you capture. — willeM_ Van Onsem
– willeM_ Van Onsem, Commented May 18, 2015 at 13:51
Shameless self promotion but: stackoverflow.com/questions/29988595/… Using the accepted solution there with lookarounds, you can probably succeed. — Shashank
– Shashank, Commented May 18, 2015 at 13:52
Oh and alternatively, if you don't want to use the accepted solution there, come up with an alternative regex with lookarounds that matches a full sentence, and use re.findall :) Then you will lose no characters in the "split". — Shashank
– Shashank, Commented May 18, 2015 at 13:57
@CommuSoft true, I assumed the OP would capture the whitespace they refer to. — jonrsharpe
– jonrsharpe, Commented May 18, 2015 at 14:01

Community · Accepted Answer · 2017-05-23 10:26:56Z

1

As @jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:

>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']

You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.

Alternative approach: alternating split/capture item

You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:

from itertools import chain, izip
import re

def nonconsumesplit(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [val for pair in zip(outer,inner) for val in pair]

Which results in:

>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']

Or you can use a string concatenation:

def nonconsumesplitconcat(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [pair[0]+pair[1] for pair in zip(outer,inner)]

Which results in:

>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']

edited May 23, 2017 at 10:26

CommunityBot

11 silver badge

answered May 18, 2015 at 14:12

willeM_ Van Onsem

482k33 gold badges483 silver badges624 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Andris Leduskrasts Over a year ago

Thanks! The uppercase solution is not really usable as there's a statistically high chance of capital words (names, towns, be it whatever), with low chance of it being combined with a number. Reconstructing the list is what I was looking for.

willeM_ Van Onsem Over a year ago

@andrisleduskrasts: well it was only an example of course. But as you have probably noted during the discussion, a lookaround regex with no character will not work. So to make the problem generic enough, one needs to reconstruct the consumed substrings oneself.

Collectives™ on Stack Overflow

Splitting longer patterns using regex without losing characters Python 3+

1 Answer 1

Alternative approach: alternating split/capture item

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Alternative approach: alternating split/capture item

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related