2

I'd like to find a regex expression that can break up paragraphs (long strings, no newline characters to worry about) into sentences with the simple rule that an of {., ?, !} followed by a whitespace and then a capital letter should be the end of the sentence (I realize this is not a good rule for real life).

I've got something partly working, but it doesn't quite do the job:

line = 'a b c FFF! D a b a a FFF. gegtat FFF. A'
matchObj = re.split(r'(.*?\sFFF[\.|\?|\!])\s[A-Z]', line)
print (matchObj)

prints

['', 'a b c FFF!', '', ' a b a a FFF. gegtat FFF.', '']

whereas I'd like to get:

['a b c FFF!', 'D a b a a FFF. gegtat FFF.']

So two questions.

  • Why are there empty members ('') in the results?

  • I understand why the D gets cut out from the split result - it's part of the first search. How can I structure my search differently so that the capital letter coming after the punctuation is put back so it can be included with the next sentence? In this case, how can I get D to turn up in the second element of the split result?

I know I could accomplish this with some sort of for-loop just peeling off the first result, adding back the capital letter and then doing it all over again, but this seems not-so-Pythonic. If regex is not the way to go here, is there something that still avoids the for loop?

Thanks for any suggestions.

0

1 Answer 1

5
  1. To solve the first problem (empty strings in the result returned by split()), use either findall() or finditer():

    >>> re.findall(r'(.*?\sFFF[\.|\?|\!])\s[A-Z]', line)
    ['a b c FFF!', ' a b a a FFF. gegtat FFF.']
    

    You were seeing empty strings in the output because that's what split() is supposed to do: split the input string using the matched groups as delimiters.

  2. For the second problem (the missing D from the output), use a lookahead assertion (?=...):

    >>> re.findall(r'(.*?\sFFF[\.|\?|\!])\s(?=[A-Z])', line)
    ['a b c FFF!', 'D a b a a FFF. gegtat FFF.']
    

    Lookaheads, negative lookaheads, lookbehinds and negative lookbehinds are four kinds of assertions that you can use to say "match this group only if followed/preceded by group, but don't consume the string".

  3. Reading carefully your expression, it seems you have misunderstood the syntax of the [...] operator. It seems you want to match one of ., ? and !.

    If that is the case, then you can rewrite [\.|\?|\!] as [.?!]:

    >>> re.findall(r'(.*?\sFFF[.?!])\s(?=[A-Z])', line)
    ['a b c FFF!', 'D a b a a FFF. gegtat FFF.']
    

    [.?!] is not only more compact, but is also more correct: with [\.|\?|\!] you were also matching the | character (so that 'a b c FFF|' were a valid match)!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.