Using Regex to change the name values format in a dataframe

Question

I'm pretty sure I'm asking the wrong question here so here goes. I have a 2 dataframes, lets call them df1 and df2.

df1 looks like this:

data = {'Employee ID' : [12345, 23456, 34567],
        'Values' : [123168546543154, 13513545435145434, 556423145613],
        'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
        'Department Supervisor' : ['Wendy Davis', 'Albus Dumbledore', 'James Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])

df2 looks similar:

data = {'Employee ID' : [12345, 23456, 34567],
        'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
        'Department Supervisor' : ['Davis, Wendy', 'Dumbledore, Albus', 'Halliday, James']}
df2 = pd.DataFrame(data, columns=['Employee ID','Employee Name','Department Supervisor'])

My issue is that df1 is from an excel file and that sometimes has an Employee ID entered and sometimes doesn't. This is where df2 comes in, df2 is a sql pull from the employee database that I'm using to validate the employee names and supervisor names to ensure the correct employee id is used.

Normally I'd be happy to merge the dataframes to get my desired result but with the supervisor names being in different formats I'd like to use regex on df1 to turn 'Wendy Davis" into 'Davis, Wendy' along with the other supervisor names to match what df2 has. So far I'm coming up empty on how I want to search this for an answer, suggestions?

Scott Boston · Accepted Answer · 2021-07-19 20:15:15Z

1

IIUC, do you need?

df1['DS Corrected'] = df1['Department Supervisor'].str.replace('(\w+) (\w+)','\\2, \\1', regex=True)

Output:

   Employee ID             Values  Employee Name Department Supervisor       DS Corrected
0        12345    123168546543154    Jones, John           Wendy Davis       Davis, Wendy
1        23456  13513545435145434  Potter, Harry      Albus Dumbledore  Dumbledore, Albus
2        34567       556423145613    Watts, Wade        James Halliday    Halliday, James

answered Jul 19, 2021 at 20:15

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Scott Boston Over a year ago

@Cwnosky Happy coding. Be safe and stay healthy.

MDR · Accepted Answer · 2021-07-19 21:26:07Z

Since Albus' full name is Albus Percival Wulfric Brian Dumbledore and James' is James Donovan Halliday (if we're talking about Ready Player One) then consider a dataframe of:

    Employee ID     Values              Employee Name       Department Supervisor
0   12345           123168546543154     Jones, John         Wendy Davis
1   23456           13513545435145434   Potter, Harry       Albus Percival Wulfric Brian Dumbledore
2   34567           556423145613        Watts, Wade         James Donovan Halliday

So we need to swap the last name to the front with...

import pandas as pd

data = {'Employee ID' : [12345, 23456, 34567],
        'Values' : [123168546543154, 13513545435145434, 556423145613],
        'Employee Name' : ['Jones, John', 'Potter, Harry', 'Watts, Wade'],
        'Department Supervisor' : ['Wendy Davis', 'Albus Percival Wulfric Brian Dumbledore', 'James Donovan Halliday']}
df1 = pd.DataFrame(data, columns=['Employee ID','Values','Employee Name','Department Supervisor'])

def swap_names(text):
    first, *middle, last = text.split()
    if len(middle) == 0:
        return last + ', ' + first
    else:
        return last + ', ' + first  + ' ' + ' '.join(middle)

df1['Department Supervisor'] = [swap_names(row) for row in df1['Department Supervisor']]

print(df1)

Outputs:

    Employee ID     Values              Employee Name   Department Supervisor
0   12345           123168546543154     Jones, John     Davis, Wendy
1   23456           13513545435145434   Potter, Harry   Dumbledore, Albus Percival Wulfric Brian
2   34567           556423145613        Watts, Wade     Halliday, James Donovan

MDR · Accepted Answer · 2021-07-19 20:44:56Z

0

Maybe...

df1['Department Supervisor'] = [', '.join(x.split()[::-1]) for x in df1['Department Supervisor']]

Outputs:

    Employee    ID  Values          Employee Name       Department Supervisor
0   12345       123168546543154     Jones, John         Davis, Wendy
1   23456       13513545435145434   Potter, Harry       Dumbledore, Albus
2   34567       556423145613        Watts, Wade         Halliday, James

answered Jul 19, 2021 at 20:44

MDR

2,6801 gold badge11 silver badges19 bronze badges

Collectives™ on Stack Overflow

Using Regex to change the name values format in a dataframe

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related