Pandas str.replace() with regex

Question

Say I have this dataframe:

df = pd.DataFrame({'Col': ['DDJFHGBC', 'AWDGUYABC']})

And I want to replace everything ending with ABC with ABC and everything ending with BC (except the ABC-cases) with BC. The output would look like:

    Col
0   BC
1   ABC

How can I achieve this using regular expressions? I've tried things like:

df.Col.str.replace(r'\w*BC\b', 'BC')
df.Col.str.replace(r'\w*ABC\b', 'ABC')

But obviously these two lines are conflicting and I would end up with just BC in whichever order I use them.

I fail to understand the purpose. Maybe add more examples so we can see the logic behind what you want. — sander
– sander, Commented May 7, 2020 at 8:59
To replace everything ending with ABC with ABC and everything ending with BC (except the ABC-cases) with BC. — CHRD
– CHRD, Commented May 7, 2020 at 8:59
Perhaps match A?BC$ or match \w*?(A?BC)\b and replace with group 1 regex101.com/r/fMcfHI/1 — The fourth bird
– The fourth bird, Commented May 7, 2020 at 9:00
I realize it should be sufficient to replace everything before BC or ABC with "". How can I do that? — CHRD
– CHRD, Commented May 7, 2020 at 9:08

The fourth bird · Accepted Answer · 2020-05-07 09:22:23Z

4

You could match as least word chars using \w*? and then capture in group 1 matching an optional A followed by BC (A?BC) followed by a word boundary.

\w*?(A?BC)\b

Regex demo

In there replacement use group 1

df.Col.str.replace(r'\w*?(A?BC)\b', r'\1')

answered May 7, 2020 at 9:22

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Wiktor Stribiżew · Accepted Answer · 2020-05-07 09:29:04Z

2

You may a replace solution like:

df['Col'].str.replace(r'(?s)^.*?(A?BC)$', r'\1')
# 0     BC
# 1    ABC

Here, (?s).*?(A?BC)$ matches

(?s) - a . will match any char including line break chars
^ - start of string
.*? - any 0+ chars, as few as possible
(A?BC) - Group 1 (referred to with \1 from the replacement pattern): an optional A and then BC
$ - end of string.

edited May 7, 2020 at 9:29

answered May 7, 2020 at 9:16

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

8 Comments

CHRD Over a year ago

This is not the best solution as it would alter strings that end with neither BC nor ABC.

Wiktor Stribiżew Over a year ago

@CHRD Which one? I have just finished writing the answer. BTW, both "work" for your data, in your question, there is no indication what to do with no-matches.

CHRD Over a year ago

I meant the first one. Thank you!

Wiktor Stribiżew Over a year ago

@CHRD What if you have 1ABC 2ABC? What will be the expected result?

CHRD Over a year ago

They would still end up as ABC.

|

Brian · Accepted Answer · 2020-05-07 09:11:41Z

1

How about this?

df.Col.str.replace(r'\w*ABC\b', 'ABC_').str.replace(r'\w*BC\b', 'BC').str.replace(r'\w*ABC_\b', 'ABC')

It first replaces \w*ABC\b with ABC_. ABC_ won't be affected by replace(r'\w*BC\b', 'BC').

Then it replaces ABC_ with ABC to convert the string back to the original one.

answered May 7, 2020 at 9:11

Brian

13.8k23 gold badges107 silver badges187 bronze badges

1 Comment

CHRD Over a year ago

This works. But how about replacing everything before BC or ABC with ""? That would only require one .replace().

Collectives™ on Stack Overflow

Pandas str.replace() with regex

3 Answers 3

Comments

8 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related