Using regex matched groups in pandas dataframe replace function

Question

I'm just learning python/pandas and like how powerful and concise it is.

During data cleaning I want to use replace on a column in a dataframe with regex but I want to reinsert parts of the match (groups).

Simple Example: lastname, firstname -> firstname lastname

I tried something like the following (actual case is more complex so excuse the simple regex):

df['Col1'].replace({'([A-Za-z])+, ([A-Za-z]+)' : '\2 \1'}, inplace=True, regex=True)

However, this results in empty values. The match part works as expected, but the value part doesn't. I guess this could be achieved by some splitting and merging, but I am looking for a general answer as to whether the regex group can be used in replace.

or df['Col1'].replace({'([A-Za-z]+), ([A-Za-z]+)' : '\\2 \\1'}, inplace=True, regex=True). — Abdou
– Abdou, Commented Jan 4, 2017 at 21:00
Really great! Just learning python as well, so please excuse the newbie mistake. Additional question: Do both ways broadcast, i.e. are the both fast, the one via .str and the one using replace() directly? — Peter D
– Peter D, Commented Jan 4, 2017 at 21:07
@PeterD, df.column.str.replace() - should be bit faster compared to df.column.replace({}), but the second one aloows you to make a few replacements in one go — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jan 4, 2017 at 21:20

Community · Accepted Answer · 2017-05-23 12:09:48Z

49

I think you have a few issues with the RegEx's.

As @Abdou just said use either '\\2 \\1' or better r'\2 \1', as '\1' is a symbol with ASCII code 1

Your solution should work if you will use correct RegEx's:

In [193]: df
Out[193]:
              name
0        John, Doe
1  Max, Mustermann

In [194]: df.name.replace({r'(\w+),\s+(\w+)' : r'\2 \1'}, regex=True)
Out[194]:
0          Doe John
1    Mustermann Max
Name: name, dtype: object

In [195]: df.name.replace({r'(\w+),\s+(\w+)' : r'\2 \1', 'Max':'Fritz'}, regex=True)
Out[195]:
0            Doe John
1    Mustermann Fritz
Name: name, dtype: object

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Jan 4, 2017 at 20:59

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Peter D Over a year ago

Thanks, especially for the nice explanation how python regex works, Most examples I have seen are so simple they can omit the r syntax without problems, it seems.

piRSquared · Accepted Answer · 2017-01-04 20:53:39Z

16

setup

df = pd.DataFrame(dict(name=['Smith, Sean']))
print(df)

          name
0  Smith, Sean

using replace

df.name.str.replace(r'(\w+),\s*(\w+)', r'\2 \1')

0    Sean Smith
Name: name, dtype: object

using extract
split to two columns

df.name.str.extract('(?P<Last>\w+),\s*(?P<First>\w+)', expand=True)

    Last First
0  Smith  Sean

answered Jan 4, 2017 at 20:53

piRSquared

296k68 gold badges509 silver badges654 bronze badges

1 Comment

Pierre D Over a year ago

the way to get the named groups is what I was looking for.

Collectives™ on Stack Overflow

Using regex matched groups in pandas dataframe replace function

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related