2

I want to match possible names from a string. A name should be 2-4 words, each with 3 or more letters, all words capitalized. For example, given this list of strings:

Her name is Emily.
I work for Surya Soft.
I sent an email for Ery Wulandari.
Welcome to the Link Building Partner program!

I want a regex that returns:

None
Surya Soft
Ery Wulandari
Link Building Partner

currently here is my code:

data = [
   'Her name is Emily.', 
   'I work for Surya Soft.', 
   'I sent an email for Ery Wulandari.', 
   'Welcome to the Link Building Partner program!'
]

for line in data:
    print re.findall('(?:[A-Z][a-z0-9]{2,}\s+[A-Z][a-z0-9]{2,})', line)

It works for the first three lines, but it fail on the last line.

2
  • Don't you basically want any situation where there's a character sequence beginning with a capital letter? Commented Jun 6, 2013 at 4:13
  • @mikebabcock yes but it should be 2-4 words Commented Jun 6, 2013 at 4:14

4 Answers 4

2

You can use grouping for repeating structure as given below:

compiled = re.compile('(?:(([A-Z][a-z0-9]{2,})\s*){2,})')
for line in data:
    match = compiled.search(line)
    if match:
       print match.group()
    else:
       print None

Output:

None
Surya Soft
Ery Wulandari
Link Building Partner 
Sign up to request clarification or add additional context in comments.

1 Comment

Yes this seems the better one; output is as expected
2

You can use:

re.findall(r'((?:[A-Z]\w{2,}\s*){2,4})', line)

It may add a trailing whitespace that can be trimmed with .strip()

Comments

1

Non-regex solution:

from string import punctuation as punc
def solve(strs):
   words = [[]]
   for i,x in enumerate(strs.split()):
      x = x.strip(punc)
      if x[0].isupper() and len(x)>2:
         if words[-1] and words[-1][-1][0] == i-1:
            words[-1].append((i,x))
         else:
            words.append([(i,x)])

   names = [" ".join(y[1] for y in x) for x in words if 2 <= len(x) <= 4]
   return ", ".join(names) if names else None


data = [
   'Her name is Emily.', 
   'I work for Surya Soft.', 
   'I sent an email for Ery Wulandari.', 
   'Welcome to the Link Building Partner abc Fooo Foo program!'
]
for x in data:
   print solve(x)

output:

None
Surya Soft
Ery Wulandari
Link Building Partner, Fooo Foo

2 Comments

"I want to match possible names from a string". Unlike findall, this can only match the last one...
@JBernardo missed that line, fixed the solution.
0
for line in data:
    print re.findall("[A-Z][\w]+", line)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.