1

I have the following column which consists of email subject headers:

Subject
EXT || Transport enquiry
EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry
EXT || FW: Model - Jan
SV: [EXTERNAL] Calculations

What I want to achieve is:

Subject
Transport enquiry
0001 || Copy of enquiry
Model - Jan
Calculations

and for this I am using the below code which only takes into account the first regular expression that I am passing and ignoring the rest

def clean_subject_prelim(text):
     text = re.sub(r'^EXT \|\| $' , '' , text)
     text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
     text = re.sub(r'EXT \|\| FW:', '' , text)
     text = re.sub(r'^SV: \[EXTERNAL]$' , '' , text)
     return text
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))

Why this is not working, what am I missing here?

3
  • 1
    Try df['subject_clean'] = df['Subject'].str.replace(r'(?m)^(?:EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)?|SV:\s*\[EXTERNAL])\s*', '', regex=True) Commented Dec 25, 2022 at 14:28
  • Worked really well but can you explain how you came to this solution? I have some other patterns as well that I would like to replace with '' so how should I incorporate the same within the pattern that you have shown? For ex: how should I incorporate " [EXTERNAL] Fwd: " into this regex? Commented Dec 25, 2022 at 14:31
  • Add the \[EXTERNAL]\s+Fwd: alternative, see regex101.com/r/LcWZdr/2 Commented Dec 25, 2022 at 14:44

2 Answers 2

1

You can use

pattern = r"""(?mx)  # MULTILINE mode on
^                   # start of string
(?:                 # non-capturing group start
   EXT\s*\|\|\s*(?:RE:\s*EXTERNAL:\s*RE:|FW:)? # EXT || or EXT || RE: EXTERNAL: RE: or EXT || FW:
 |                  # or
   SV:\s*\[EXTERNAL]# SV: [EXTERNAL]
)                   # non-capturing group end
\s*                 # zero or more whitespaces
"""
df['subject_clean'] = df['Subject'].str.replace(pattern', '', regex=True)

See the regex demo.

Since the re.X ((?x)) is used, you should escape literal spaces and # chars, or just use \s* or \s+ to match zero/one or more whitespaces.

Sign up to request clarification or add additional context in comments.

Comments

1

Get rid of the $ sign in the first expression and switch some of regex expressions from place. Like this:

import pandas as pd
import re

def clean_subject_prelim(text):
     text = re.sub(r'EXT \|\| RE: EXTERNAL: RE:', '' , text)
     text = re.sub(r'EXT \|\| FW:', '' , text)
     text = re.sub(r'^EXT \|\|' , '' , text)
     text = re.sub(r'^SV: \[EXTERNAL]' , '' , text)
     return text

data = {"Subject": [
"EXT || Transport enquiry",
"EXT || RE: EXTERNAL: RE: 0001 || Copy of enquiry",
"EXT || FW: Model - Jan",
"SV: [EXTERNAL] Calculations"]}

df = pd.DataFrame(data)
df['subject_clean'] = df['Subject'].apply(lambda x: clean_subject_prelim(x))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.