3

I'm trying to extract all the first names AND the last names (ex: John Johnson) in a big text (about 20 pages).

I used split with \. as separator and there is my regular expression:

\b([A-Z]{1}[a-z]+\s{1})([A-Z]{1}[a-z]+)\b

Unfortunately, I only get all the lines of my text instead of only the first names and last names:

Suddenly, Mary Poppins flew away with her umbrella
Later in the day, John.... bla bla bla

Could someone help me?

9
  • 3
    What does [nsregularexpression] have to do with Python? Commented Dec 3, 2013 at 14:37
  • 1
    What do you mean by splitting with . as separator? . means any character, and your task seems to be searching, not splitting. What's the input you provide to the regex you mention? Directly using re.search on the pattern and sentence you mention does identify the name as ("Mary ", "Poppins"). Commented Dec 3, 2013 at 14:41
  • 3
    Note that {1} is implicit; \s and \s{1} both match just one character. Commented Dec 3, 2013 at 14:41
  • 2
    What are your rules for defining name and surname? What we must expect them to be like? All names and surnames start with capital or surnames are all capital? How do you plan to separate a name or surname from a word that is the first word after a comma or at the beginning of a sentence (hence starts with a capital)? Commented Dec 3, 2013 at 14:45
  • 5
    I suggest reading kalzumeus.com/2010/06/17/… and then giving up. Commented Dec 3, 2013 at 14:45

2 Answers 2

2

Try

regex = re.compile("\b([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)\b")
string = """Suddenly, Mary Poppins flew away with her umbrella
Later in the day, John Johnson did something."""
regex.findall(string)

The output I got was:

[(u'Mary', u'Poppins'), (u'John', u'Johnson')]
Sign up to request clarification or add additional context in comments.

Comments

1

I've adapted one regular expression that can handle accents and dash for composed names:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
r = re.compile('([A-Z]\w+(?=[\s\-][A-Z])(?:[\s\-][A-Z]\w+)+)',
           re.UNICODE)
tests = {
    u'Jean Vincent Placé': u'Jean Vincent Placé est un excellent donneur de leçons',
    u'Giovanni Delle Bande Nere': u'In quest\'anno Giovanni Delle Bande Nere ha avuto tre momenti di gloria',
    # Here 'BDFL' may not be whished
    u'BDFL Guido Van Rossum': u'Nobody hacks Python like BDFL Guido Van Rossum because he created it'
}
for expected, s in tests.iteritems():
    match = r.search(s)
    assert(match is not None)
    extracted = match.group(0)
    print expected
    print extracted
    assert(expected == match.group(0))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.