I'd like to find a regex expression that can break up paragraphs (long strings, no newline characters to worry about) into sentences with the simple rule that an of {., ?, !} followed by a whitespace and then a capital letter should be the end of the sentence (I realize this is not a good rule for real life).
I've got something partly working, but it doesn't quite do the job:
line = 'a b c FFF! D a b a a FFF. gegtat FFF. A'
matchObj = re.split(r'(.*?\sFFF[\.|\?|\!])\s[A-Z]', line)
print (matchObj)
prints
['', 'a b c FFF!', '', ' a b a a FFF. gegtat FFF.', '']
whereas I'd like to get:
['a b c FFF!', 'D a b a a FFF. gegtat FFF.']
So two questions.
Why are there empty members (
'') in the results?I understand why the
Dgets cut out from the split result - it's part of the first search. How can I structure my search differently so that the capital letter coming after the punctuation is put back so it can be included with the next sentence? In this case, how can I get D to turn up in the second element of the split result?
I know I could accomplish this with some sort of for-loop just peeling off the first result, adding back the capital letter and then doing it all over again, but this seems not-so-Pythonic. If regex is not the way to go here, is there something that still avoids the for loop?
Thanks for any suggestions.