Python regex split but put end part of regex match back into string?

Question

I'd like to find a regex expression that can break up paragraphs (long strings, no newline characters to worry about) into sentences with the simple rule that an of {., ?, !} followed by a whitespace and then a capital letter should be the end of the sentence (I realize this is not a good rule for real life).

I've got something partly working, but it doesn't quite do the job:

line = 'a b c FFF! D a b a a FFF. gegtat FFF. A'
matchObj = re.split(r'(.*?\sFFF[\.|\?|\!])\s[A-Z]', line)
print (matchObj)

prints

['', 'a b c FFF!', '', ' a b a a FFF. gegtat FFF.', '']

whereas I'd like to get:

['a b c FFF!', 'D a b a a FFF. gegtat FFF.']

So two questions.

Why are there empty members ('') in the results?
I understand why the D gets cut out from the split result - it's part of the first search. How can I structure my search differently so that the capital letter coming after the punctuation is put back so it can be included with the next sentence? In this case, how can I get D to turn up in the second element of the split result?

I know I could accomplish this with some sort of for-loop just peeling off the first result, adding back the capital letter and then doing it all over again, but this seems not-so-Pythonic. If regex is not the way to go here, is there something that still avoids the for loop?

Thanks for any suggestions.

Andrea Corbellini · Accepted Answer · 2015-05-09 08:16:36Z

To solve the first problem (empty strings in the result returned by split()), use either findall() or finditer():
```
>>> re.findall(r'(.*?\sFFF[\.|\?|\!])\s[A-Z]', line)
['a b c FFF!', ' a b a a FFF. gegtat FFF.']
```
You were seeing empty strings in the output because that's what split() is supposed to do: split the input string using the matched groups as delimiters.
For the second problem (the missing D from the output), use a lookahead assertion (?=...):
```
>>> re.findall(r'(.*?\sFFF[\.|\?|\!])\s(?=[A-Z])', line)
['a b c FFF!', 'D a b a a FFF. gegtat FFF.']
```
Lookaheads, negative lookaheads, lookbehinds and negative lookbehinds are four kinds of assertions that you can use to say "match this group only if followed/preceded by group, but don't consume the string".
Reading carefully your expression, it seems you have misunderstood the syntax of the [...] operator. It seems you want to match one of ., ? and !.

If that is the case, then you can rewrite [\.|\?|\!] as [.?!]:
```
>>> re.findall(r'(.*?\sFFF[.?!])\s(?=[A-Z])', line)
['a b c FFF!', 'D a b a a FFF. gegtat FFF.']
```
[.?!] is not only more compact, but is also more correct: with [\.|\?|\!] you were also matching the | character (so that 'a b c FFF|' were a valid match)!

Collectives™ on Stack Overflow

Python regex split but put end part of regex match back into string?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related