Nested regex replacement in a loop with Pandas

Question

I am trying to conduct nested regex replacement in pandas and I am having hard time capturing all nested components in regex.

For example, I would like to remove all instances of 'ba' and 'ba ca' from column A in dataframe. But I am able to remove only 'ba' while 'ca' part of "ba ca" is not being removed because I think 'ba' is nested within 'ba ca'

df = pd.DataFrame({'A': ['ba t', 'ba ca t', 'foo', 'ba it'],'B': ['abc','abc', 'bar', 'xyz']})

replace_list=['ba','ba ca']

for i in replace_list:
    df=df.replace({'A': r'^({})'.format(i)}, {'A': ''}, regex=True)
df

I would expect row index=1 for column A to be t and not ca t. Any help is highly appreciated.

       A    B
0      t  abc
1   ca t  abc
2    foo  bar
3     it  xyz

Chris · Accepted Answer · 2019-06-21 04:54:26Z

3

Make the replace_list into a single regex:

df['A'].str.replace('|'.join(replace_list[::-1]), '').str.strip()

Output:

0      t
1      t
2    foo
3     it
Name: A, dtype: object

Note the reversed replace_list, so that it first checks ba ca and then ba, thus not leaving the ca part.

answered Jun 21, 2019 at 4:54

Chris

29.8k3 gold badges34 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

cs95 Over a year ago

The "it first checks ba ca and then ba, thus not leaving the ca part." should have come first.

Sveta Over a year ago

Thank you, I tried below code and I am getting the same wrong result. What am I missing? df = pd.DataFrame({'A': ['ba t', 'ba ca t', 'foo', 'ba it'],'B': ['abc','abc', 'bar', 'xyz']}) replace_list=['ba ca','ba'] df['A'].str.replace('|'.join(replace_list[::-1]), '').str.strip()

Chris Over a year ago

@cs95 Are you suggesting that I should move the note part to the top of the answer?

Himmat Over a year ago

You can have your replace list in descending order of their respective string lengths. That will check the 'ba ca' first and then 'ba' as stated in the above answer. And it will definitely work well if you have more items in your replace list.

Chris Over a year ago

@Nora Change replace_list[::-1] to replace_list, since the ba ca comes first now. reversing the list is unnecessary :)

|

Collectives™ on Stack Overflow

Nested regex replacement in a loop with Pandas

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related