2

I have a Pandas dataframe column containing text that needs to be cleaned of strings that match various regex patterns. My current attempt (given below) loops through each pattern, creating a new column containing the match if found, and then loops through the dataframe, splitting the column at the found match. I then drop the unneeded matching column 're_match'.

While this works for my current use case, I can't help but think that there must be a much more efficient, vectorised way of doing this in pandas, without needing to use iterrows() and creating a new column. My question is, is there a more optimal way of removing strings that match multiple regex patterns from a column?

In my current use case the unwanted strings are always at the end of the text block, hence, the use of split(...)[0]. However, it would be great if the unwanted strings could be extracted from any point in the text.

Also, note that combining the regexes into one long single pattern would be unpreferrable, as there are tens of patterns of which will change on a regular basis.

df = pd.read_csv('data.csv', index_col=0)
patterns = [
    '( regex1 \d+)',
    '((?: regex 2)? \d{1,2} )',
    '( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )',
]

for p in patterns:

    df['re_match'] = df['text'].str.extract(
        pat=p, flags=re.IGNORECASE, expand=False
    )
    df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')

    for index, row in df.iterrows():
        df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]

df = df.drop('re_match', axis=1)

Thank you for your help

1
  • I'm not familiar with pandas, but the problem here as I understood might come from the data structure called dataframe. The simple way to overcome this task might be just use a pure python or sed. Commented Jul 28, 2016 at 11:57

1 Answer 1

1

There is indeed and it is called df.applymap(some_function).
Consider the following example:

from pandas import DataFrame
import pandas as pd, re
df = DataFrame({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']})

def cleanitup(val):
    """ Multiplies digit values """
    rx = re.compile(r'^\d+$')
    if rx.match(val):
        return int(val) * 10
    else:
        return val

# here is where the magic starts
df.applymap(cleanitup)

Obviously, I made it up, but now in every cell with only digits before, these have been multiplied by 10, every other value has been left untouched.
With this in mind, you can check and rearrange your values if necessary in the function cleanitup().

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.