1

I'm trying to fetch below patterns from the text using RegEx:

John Doe
JOHN DOE
Sam John Watson
Sam John Lilly Watson
SAM JOHN WATSON
SAM JOHN LILLY WATSON

Input Data only contains single line and I need to find above patterns in that.

More about Pattern

  • Each word will start with a Uppercase letter and followed by either Upper or Lowercase
  • Minimum 2 words
  • Maximum 4 words
  • Words will include only A-Z or a-z chars

What I Tried:

import re
re.findall("[A-Z][A-Za-z]+ [A-Z][A-Za-z]+ [A-Za-z]* [A-Za-z]*", text)

Which will correctly identifies input like:

Sam Peters John Doe
SAM WINCH DAN BROWN

but fails on input with less than 4 words.

1
  • 2
    If this is for a real system rather than a programming exercise, it is probably worth reading Falsehoods Programmers Believe About Names Commented Dec 27, 2018 at 9:57

1 Answer 1

3

Your pattern is failing because even with the *s after the last two character sets, the spaces next to those last two character sets are not optional. So (for example) having only 2 words in the string would only match if those two words were followed by two spaces.

I'd suggest that you start with [A-Z][A-Za-z]+ for the first word, then repeat a space followed by a word up to 3 times:

^[A-Z][A-Za-z]+(?: [A-Z][A-Za-z]+){1,3}$

https://regex101.com/r/IvSvAH/1

If there may be words with only one character (like "I" or "A"), then repeat the [A-Za-z] character sets with * instead of +.

Sign up to request clarification or add additional context in comments.

2 Comments

works like a charm. It would be more helpful if you point out why my solution failed in this case.
@Sociopath I might also suggest a new username. But +1 to this answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.