41

I'm just learning python/pandas and like how powerful and concise it is.

During data cleaning I want to use replace on a column in a dataframe with regex but I want to reinsert parts of the match (groups).

Simple Example: lastname, firstname -> firstname lastname

I tried something like the following (actual case is more complex so excuse the simple regex):

df['Col1'].replace({'([A-Za-z])+, ([A-Za-z]+)' : '\2 \1'}, inplace=True, regex=True)

However, this results in empty values. The match part works as expected, but the value part doesn't. I guess this could be achieved by some splitting and merging, but I am looking for a general answer as to whether the regex group can be used in replace.

4
  • 1
    Please share some data for testing. Commented Jan 4, 2017 at 20:53
  • or df['Col1'].replace({'([A-Za-z]+), ([A-Za-z]+)' : '\\2 \\1'}, inplace=True, regex=True). Commented Jan 4, 2017 at 21:00
  • Really great! Just learning python as well, so please excuse the newbie mistake. Additional question: Do both ways broadcast, i.e. are the both fast, the one via .str and the one using replace() directly? Commented Jan 4, 2017 at 21:07
  • 1
    @PeterD, df.column.str.replace() - should be bit faster compared to df.column.replace({}), but the second one aloows you to make a few replacements in one go Commented Jan 4, 2017 at 21:20

2 Answers 2

49

I think you have a few issues with the RegEx's.

As @Abdou just said use either '\\2 \\1' or better r'\2 \1', as '\1' is a symbol with ASCII code 1

Your solution should work if you will use correct RegEx's:

In [193]: df
Out[193]:
              name
0        John, Doe
1  Max, Mustermann

In [194]: df.name.replace({r'(\w+),\s+(\w+)' : r'\2 \1'}, regex=True)
Out[194]:
0          Doe John
1    Mustermann Max
Name: name, dtype: object

In [195]: df.name.replace({r'(\w+),\s+(\w+)' : r'\2 \1', 'Max':'Fritz'}, regex=True)
Out[195]:
0            Doe John
1    Mustermann Fritz
Name: name, dtype: object
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, especially for the nice explanation how python regex works, Most examples I have seen are so simple they can omit the r syntax without problems, it seems.
16

setup

df = pd.DataFrame(dict(name=['Smith, Sean']))
print(df)

          name
0  Smith, Sean

using replace

df.name.str.replace(r'(\w+),\s*(\w+)', r'\2 \1')

0    Sean Smith
Name: name, dtype: object

using extract
split to two columns

df.name.str.extract('(?P<Last>\w+),\s*(?P<First>\w+)', expand=True)

    Last First
0  Smith  Sean

1 Comment

the way to get the named groups is what I was looking for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.