Removing strings that match multiple regex patterns from pandas series

Question

I have a Pandas dataframe column containing text that needs to be cleaned of strings that match various regex patterns. My current attempt (given below) loops through each pattern, creating a new column containing the match if found, and then loops through the dataframe, splitting the column at the found match. I then drop the unneeded matching column 're_match'.

While this works for my current use case, I can't help but think that there must be a much more efficient, vectorised way of doing this in pandas, without needing to use iterrows() and creating a new column. My question is, is there a more optimal way of removing strings that match multiple regex patterns from a column?

In my current use case the unwanted strings are always at the end of the text block, hence, the use of split(...)[0]. However, it would be great if the unwanted strings could be extracted from any point in the text.

Also, note that combining the regexes into one long single pattern would be unpreferrable, as there are tens of patterns of which will change on a regular basis.

df = pd.read_csv('data.csv', index_col=0)
patterns = [
    '( regex1 \d+)',
    '((?: regex 2)? \d{1,2} )',
    '( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )',
]

for p in patterns:

    df['re_match'] = df['text'].str.extract(
        pat=p, flags=re.IGNORECASE, expand=False
    )
    df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')

    for index, row in df.iterrows():
        df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]

df = df.drop('re_match', axis=1)

Thank you for your help

I'm not familiar with pandas, but the problem here as I understood might come from the data structure called dataframe. The simple way to overcome this task might be just use a pure python or sed. — fronthem
– fronthem, Commented Jul 28, 2016 at 11:57

Jan · Accepted Answer · 2016-07-28 12:08:56Z

1

There is indeed and it is called df.applymap(some_function).
Consider the following example:

from pandas import DataFrame
import pandas as pd, re
df = DataFrame({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']})

def cleanitup(val):
    """ Multiplies digit values """
    rx = re.compile(r'^\d+$')
    if rx.match(val):
        return int(val) * 10
    else:
        return val

# here is where the magic starts
df.applymap(cleanitup)

Obviously, I made it up, but now in every cell with only digits before, these have been multiplied by 10, every other value has been left untouched.
With this in mind, you can check and rearrange your values if necessary in the function cleanitup().

answered Jul 28, 2016 at 12:08

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Removing strings that match multiple regex patterns from pandas series

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related