Extracting first name and last name in Python

Question

I'm trying to extract all the first names AND the last names (ex: John Johnson) in a big text (about 20 pages).

I used split with \. as separator and there is my regular expression:

\b([A-Z]{1}[a-z]+\s{1})([A-Z]{1}[a-z]+)\b

Unfortunately, I only get all the lines of my text instead of only the first names and last names:

Suddenly, Mary Poppins flew away with her umbrella
Later in the day, John.... bla bla bla

Could someone help me?

What do you mean by splitting with . as separator? . means any character, and your task seems to be searching, not splitting. What's the input you provide to the regex you mention? Directly using re.search on the pattern and sentence you mention does identify the name as ("Mary ", "Poppins"). — svk
– svk, Commented Dec 3, 2013 at 14:41
Note that {1} is implicit; \s and \s{1} both match just one character. — Martijn Pieters
– Martijn Pieters, Commented Dec 3, 2013 at 14:41
What are your rules for defining name and surname? What we must expect them to be like? All names and surnames start with capital or surnames are all capital? How do you plan to separate a name or surname from a word that is the first word after a comma or at the beginning of a sentence (hence starts with a capital)? — Mp0int
– Mp0int, Commented Dec 3, 2013 at 14:45
I suggest reading kalzumeus.com/2010/06/17/… and then giving up. — Wooble
– Wooble, Commented Dec 3, 2013 at 14:45

Jack burridge · Accepted Answer · 2013-12-04 21:42:17Z

2

Try

regex = re.compile("\b([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)\b")
string = """Suddenly, Mary Poppins flew away with her umbrella
Later in the day, John Johnson did something."""
regex.findall(string)

The output I got was:

[(u'Mary', u'Poppins'), (u'John', u'Johnson')]

answered Dec 4, 2013 at 21:42

Jack burridge

5207 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

AsTeR · Accepted Answer · 2014-01-26 14:23:27Z

I've adapted one regular expression that can handle accents and dash for composed names:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
r = re.compile('([A-Z]\w+(?=[\s\-][A-Z])(?:[\s\-][A-Z]\w+)+)',
           re.UNICODE)
tests = {
    u'Jean Vincent Placé': u'Jean Vincent Placé est un excellent donneur de leçons',
    u'Giovanni Delle Bande Nere': u'In quest\'anno Giovanni Delle Bande Nere ha avuto tre momenti di gloria',
    # Here 'BDFL' may not be whished
    u'BDFL Guido Van Rossum': u'Nobody hacks Python like BDFL Guido Van Rossum because he created it'
}
for expected, s in tests.iteritems():
    match = r.search(s)
    assert(match is not None)
    extracted = match.group(0)
    print expected
    print extracted
    assert(expected == match.group(0))

Collectives™ on Stack Overflow

Extracting first name and last name in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related