I have a Pandas dataframe column containing text that needs to be cleaned of strings that match various regex patterns. My current attempt (given below) loops through each pattern, creating a new column containing the match if found, and then loops through the dataframe, splitting the column at the found match. I then drop the unneeded matching column 're_match'.
While this works for my current use case, I can't help but think that there must be a much more efficient, vectorised way of doing this in pandas, without needing to use iterrows() and creating a new column. My question is, is there a more optimal way of removing strings that match multiple regex patterns from a column?
In my current use case the unwanted strings are always at the end of the text block, hence, the use of split(...)[0]. However, it would be great if the unwanted strings could be extracted from any point in the text.
Also, note that combining the regexes into one long single pattern would be unpreferrable, as there are tens of patterns of which will change on a regular basis.
df = pd.read_csv('data.csv', index_col=0)
patterns = [
'( regex1 \d+)',
'((?: regex 2)? \d{1,2} )',
'( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )',
]
for p in patterns:
df['re_match'] = df['text'].str.extract(
pat=p, flags=re.IGNORECASE, expand=False
)
df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')
for index, row in df.iterrows():
df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]
df = df.drop('re_match', axis=1)
Thank you for your help
pandas, but the problem here as I understood might come from the data structure calleddataframe. The simple way to overcome this task might be just use a pure python or sed.