Removing different string patterns from Pandas column

Question

I have the following column which consists of email subject headers:

Subject
EXT || Transport enquiry
EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry
EXT || FW: Model - Jan
SV: [EXTERNAL] Calculations

What I want to achieve is:

Subject
Transport enquiry
0001 || Copy of enquiry
Model - Jan
Calculations

and for this I am using the below code which only takes into account the first regular expression that I am passing and ignoring the rest

def clean_subject_prelim(text):
     text = re.sub(r'^EXT \|\| $' , '' , text)
     text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
     text = re.sub(r'EXT \|\| FW:', '' , text)
     text = re.sub(r'^SV: \[EXTERNAL]$' , '' , text)
     return text
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))

Why this is not working, what am I missing here?

Try df['subject_clean'] = df['Subject'].str.replace(r'(?m)^(?:EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)?|SV:\s*\[EXTERNAL])\s*', '', regex=True) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 25, 2022 at 14:28
Worked really well but can you explain how you came to this solution? I have some other patterns as well that I would like to replace with '' so how should I incorporate the same within the pattern that you have shown? For ex: how should I incorporate " [EXTERNAL] Fwd: " into this regex? — Django0602
– Django0602, Commented Dec 25, 2022 at 14:31
Add the \[EXTERNAL]\s+Fwd: alternative, see regex101.com/r/LcWZdr/2 — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Dec 25, 2022 at 14:44

Wiktor Stribiżew · Accepted Answer · 2022-12-25 14:43:01Z

1

You can use

pattern = r"""(?mx)  # MULTILINE mode on
^                   # start of string
(?:                 # non-capturing group start
   EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)? # EXT || or EXT || RE: EXTERNAL: RE: or EXT || FW:
 |                  # or
   SV:\s*\[EXTERNAL]# SV: [EXTERNAL]
)                   # non-capturing group end
\s*                 # zero or more whitespaces
"""
df['subject_clean'] = df['Subject'].str.replace(pattern', '', regex=True)

See the regex demo.

Since the re.X ((?x)) is used, you should escape literal spaces and # chars, or just use \s* or \s+ to match zero/one or more whitespaces.

answered Dec 25, 2022 at 14:43

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

braml1 · Accepted Answer · 2022-12-25 14:48:38Z

1

Get rid of the $ sign in the first expression and switch some of regex expressions from place. Like this:

import pandas as pd
import re

def clean_subject_prelim(text):
     text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
     text = re.sub(r'EXT \|\| FW:', '' , text)
     text = re.sub(r'^EXT \|\|' , '' , text)
     text = re.sub(r'^SV: \[EXTERNAL]' , '' , text)
     return text

data = {"Subject": [
"EXT || Transport enquiry",
"EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry",
"EXT || FW: Model - Jan",
"SV: [EXTERNAL] Calculations"]}

df = pd.DataFrame(data)
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))

answered Dec 25, 2022 at 14:48

braml1

5843 silver badges13 bronze badges

Collectives™ on Stack Overflow

Removing different string patterns from Pandas column

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related