1

My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:

re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)

I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.

"Is this the number 3? The text goes on..."

will look like

"Is this the number " and "he text goes on..."

Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?

8
  • Have you considered using lookarounds to find the space on which you actually want to split? Commented May 18, 2015 at 13:47
  • @jonrsharpe: that only works if there is at least one character you capture. Commented May 18, 2015 at 13:51
  • Shameless self promotion but: stackoverflow.com/questions/29988595/… Using the accepted solution there with lookarounds, you can probably succeed. Commented May 18, 2015 at 13:52
  • Oh and alternatively, if you don't want to use the accepted solution there, come up with an alternative regex with lookarounds that matches a full sentence, and use re.findall :) Then you will lose no characters in the "split". Commented May 18, 2015 at 13:57
  • @CommuSoft true, I assumed the OP would capture the whitespace they refer to. Commented May 18, 2015 at 14:01

1 Answer 1

1

As @jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:

>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']

You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.

Alternative approach: alternating split/capture item

You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:

from itertools import chain, izip
import re

def nonconsumesplit(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [val for pair in zip(outer,inner) for val in pair]

Which results in:

>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']

Or you can use a string concatenation:

def nonconsumesplitconcat(regex,content):
    outer = re.split(regex,content)
    inner = re.findall(regex,content)+['']
    return [pair[0]+pair[1] for pair in zip(outer,inner)]

Which results in:

>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! The uppercase solution is not really usable as there's a statistically high chance of capital words (names, towns, be it whatever), with low chance of it being combined with a number. Reconstructing the list is what I was looking for.
@andrisleduskrasts: well it was only an example of course. But as you have probably noted during the discussion, a lookaround regex with no character will not work. So to make the problem generic enough, one needs to reconstruct the consumed substrings oneself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.