pandas string replace function with regex gives wrong result

Question

dfF:

    Sample  AlmostFinal  
    1          KOPLA234        
    1          KOPLA234
    2          RWPLB253
    3          MMPLA415
    3          MMPLA415

I need to replace KOPL and RWP and MM to KOLPOL and last char a/b should stay. So result shoud be:

    Sample  AlmostFinal  Final
    1          KOPLA234  KOLPOLA234      
    1          KOPLA234  KOLPOLA234
    2          RWPLB253  KOLPOLB253
    3          MMPLA415  KOLPOLA415
    3          MMPLA415  KOLPOLA415

I tried to do it by replace:

    dfF['Final'] = (dfF['AlmostFinal'].replace({'KOPL':'KOLPOL'}, regex = True))
    dfF['Final'] = (dfF['AlmostFinal'].replace({'RWP':'KOLPOL'}, regex = True))
    dfF['Final'] = (dfF['AlmostFinal'].replace({'MMPL':'KOLPOL'}, regex = True))

And: If i comment 2th and 3th line replaces for KOPL works.

When I comment 1st and 3th replace for RWP works.

But when I uncomment all and try to run all 3 lines works only last. Why? In another script I have a similar code and it changes whole while and whole lines works.

How does replacing 'MM' in 'MMPLA415' with 'KOLPOL' make it 'KOLPOLA415'? — DYZ
– DYZ, Commented Jun 28, 2019 at 6:17
The reason your code does not work is because the last line overwrites the results from the first two lines. Can you please explain whether you're trying to replace all strings beginning with MM upto the last char, or specifically MMPL, or what is it? — cs95
– cs95, Commented Jun 28, 2019 at 6:20
Still wrong. Replacing RWP with KOLPOL in RWPLB253 makes it KOLPOLLB253, not KOLPOLB253 — DYZ
– DYZ, Commented Jun 28, 2019 at 6:21

cs95 · Accepted Answer · 2019-06-28 06:23:12Z

1

You can use a single replace call with regex=True:

df['Final'] = df['AlmostFinal'].replace(
    [r'KOPL', r'RWP.*?(?=A|B)', r'MM.*(?=A|B)'], 'KOLPOL', regex=True)
df

   Sample AlmostFinal       Final
0       1    KOPLA234  KOLPOLA234
1       1    KOPLA234  KOLPOLA234
2       2    RWPLB253  KOLPOLB253
3       3    MMPLA415  KOLPOLA415
4       3    MMPLA415  KOLPOLA415

We want to be able to handle varying number of characters between the substrings and the last character, so regex with lookahead will be useful here.

Further generalisation is possible. Just define your substrings, then insert a lookahead via list comp.

pat = ['KOPL', 'RWP', 'MM']
df['Final'] = df['AlmostFinal'].replace(
    [rf'{p}.*(?=A|B)' for p in pat], 'KOLPOL', regex=True)  # need python3.6+
df

   Sample AlmostFinal       Final
0       1    KOPLA234  KOLPOLA234
1       1    KOPLA234  KOLPOLA234
2       2    RWPLB253  KOLPOLB253
3       3    MMPLA415  KOLPOLA415
4       3    MMPLA415  KOLPOLA415

If you want to replace specific substrings, the solution is a little more simple.

pat = ['KOPL', 'RWPL', 'MMPL']
df['AlmostFinal'].replace(pat, 'KOLPOL', regex=True)

0    KOLPOLA234
1    KOLPOLA234
2    KOLPOLB253
3    KOLPOLA415
4    KOLPOLA415
Name: AlmostFinal, dtype: object

No other modifications required. For more general replacements, see above.

edited Jun 28, 2019 at 6:23

answered Jun 28, 2019 at 6:11

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

martin Over a year ago

Thank U very much for examples and explanation. That's very useful! :)

Masklinn · Accepted Answer · 2019-06-28 06:15:54Z

1

And: If i comment 2th and 3th line replaces for KOPL works. When I comment 1st and 3th replace for RWP works. But when I uncomment all and try to run all 3 lines works only last. Why?

Because replace creates a new dataframe, and since you're always doing the replacement on the one original dataframe, each replace throws away the result of the previous one.

Either do all replacements simultaneously e.g. use a regex or I guess a single dict with multiple values (not sure why you'd use a dict for a single value here really:

{
    'KOPL':'KOLPOL',
    'RWP':'KOLPOL',
    'MMP':'KOLPOL',
}

or perform each replace on the result of the previous one (either chain replace, or the second and third should work on df['Final']).

answered Jun 28, 2019 at 6:15

Masklinn

43.7k4 gold badges58 silver badges78 bronze badges

2 Comments

cs95 Over a year ago

Does not work for the same reason as mentioned here. It is not guaranteed what follows the substrings listed.

DYZ Over a year ago

@cs95 There is an inconsistency in the OP. The description of the operation does not match the expected results.

DYZ · Accepted Answer · 2019-06-28 06:22:29Z

1

You should execute one assignment, not three. Otherwise, each next assignment overwrites the results of the previous assignment.

dfF['Final'] = dfF['AlmostFinal']\
               .replace({'KOP|RWP|MMP': 'KOLPO'}, regex = True)

edited Jun 28, 2019 at 6:22

answered Jun 28, 2019 at 6:12

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

2 Comments

DYZ Over a year ago

@cs95 It does produce the expected output after the OPs edits.

cs95 Over a year ago

It's confusing, but I guess we'll have to wait for them to say :)

Collectives™ on Stack Overflow

pandas string replace function with regex gives wrong result

3 Answers 3

1 Comment

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related