Python and Regular Expression Substring

Question

I'm attempting to do this:

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
test_str = u"Russ Middleton and Lisa Murro\nRon Iervolino, Trish and Russ Middleton, and Lisa Middleton \nRon Iervolino, Kelly  and Tom Murro\nRon Iervolino, Trish and Russ Middleton and Lisa Middleton "
subst = u"$1$2 $3"
result = re.sub(p, subst, test_str)

The goal is to get something that both matches all the names and fills in last names when necessary (e.g., Trish and Russ Middleton becomes Trish Middleton and Russ Middleton). In the end, I'm looking for the names that appear together in a single line.

Someone else was kind enough to help me with the regex, and I thought I knew how to write it programmatically in Python (although I'm new to Python). Not being able to get it, I resorted to using the code generated by Regex101 (the code shown above). However, all I get in result is:

u'$1$2 $3 and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3, and $1$2 $3 \n$1$2 $3, $1$2 $3  and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3 and $1$2 $3 '

What am I missing with Python and regular expressions?

Alex Martelli · Accepted Answer · 2015-01-10 03:01:14Z

1

You're not using the right syntax for subst -- try, rather

subst = r'\1\2 \3'

However, now you have the problem there aren't three matched groups in the matches.

Specifically:

>>> for x in p.finditer(test_str): print(x.groups())
... 
('Russ Middleton', None, None)
('Lisa Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)
('Ron Iervolino', None, None)
(None, 'Kelly', 'Murro')
('Tom Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)

whenever you see a None here, it will be an error to try and interpolate the corresponding group (\1, etc) in a substitution.

A function can be more flexible:

>>> def mysub(mo):
...   return '{}{} {}'.format(
...     mo.group(1) or '',
...     mo.group(2) or '',
...     mo.group(3) or '')
... 
>>> result = re.sub(p, mysub, test_str)
>>> result
'Russ Middleton  and Lisa Murro \nRon Iervolino , Trish Middleton and Russ Middleton , and Lisa Middleton  \nRon Iervolino , Kelly Murro  and Tom Murro \nRon Iervolino , Trish Middleton and Russ Middleton  and Lisa Middleton  '

Here, I've coded mysub to do what I suspect you thought a substitution string with group numbers would do for you -- use an empty string where a group did not match (i.e, the corresponding mo.group(...) is None).

edited Jan 10, 2015 at 3:01

answered Jan 10, 2015 at 2:49

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Avinash Raj Over a year ago

i tried with tha, but it shows unmatched group error

Alex Martelli Over a year ago

Exactly -- you don't have three matching groups, as I said. Let me show you by editing the answer.

Avinash Raj Over a year ago

but the regex has exactly three capturing groups.

Alex Martelli Over a year ago

Yes, but, where you see None above, the corresponding group has not matched, thus it can't be substituted. You could do it with a function -- let me edit the answer to show how.

Avinash Raj · Accepted Answer · 2015-01-10 03:10:19Z

I suggest you a simple solution.

import re
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton """
m = re.sub(r'(?<=,\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

Output:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton

DEMO

OR

import regex
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton 
Trish and Russ Middleton"""
m = regex.sub(r'(?<!\b[A-Z]\w+\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

Output:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton 
Trish Middleton and Russ Middleton

TheOriginalBMan · Accepted Answer · 2015-01-10 03:15:41Z

Alex: I see what you're saying about the groups. That didn't occur to me. Thanks!

I took a fresh (ish) approach. This appears to be working. Any thoughts on it?

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
temp_result = p.findall(s)
joiner = " ".join
out = [joiner(words).strip() for words in temp_result]

Here is some input data:

test_data = ['John Smith, Barri Lieberman, Nancy Drew','Carter Bays and Craig Thomas','John Smith and Carter Bays',
                     'Jena Silverman, John Silverman, Tess Silverman, and Dara Silverman', 'Tess and Dara Silverman',
                     'Nancy Drew, John Smith, and Daniel Murphy', 'Jonny Podell']

I put the code above in a function so I could call it on every item in the list. Calling it on the list above, I get as output (from the function) this:

['John Smith', 'Barri Lieberman', 'Nancy Drew']
['Carter Bays', 'Craig Thomas']
['John Smith', 'Carter Bays']
['Jena Silverman', 'John Silverman', 'Tess Silverman', 'Dara Silverman']
['Tess Silverman', 'Dara Silverman']
['Nancy Drew', 'John Smith', 'Daniel Murphy']
['Jonny Podell']

Collectives™ on Stack Overflow

Python and Regular Expression Substring

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related