Using regular expression to filter out pandas data frames

Question

I have a data frame that looks like the following:

I want to filter out all words within a list. eg. ['King', 'sEAttle', 'California']. Here is my code

import pandas as pd
import re 

remove_words = ['King', 'sEAttle', 'California']

remove_words_lower = (map(lambda x: x.lower(), remove_words))
pattern = '|'.join(remove_words_lower)

t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})


clean_tweets = []
for i, tweet in enumerate(df.tweets):
    tweet = tweet.lower()
    clean_tweet = re.sub(pattern, "", tweet)
    clean_tweets.append(clean_tweet)

df['clean_tweets'] = clean_tweets
df

Here is the result:

Is there a way I can modify the RE to remove @county city, and #? In other words, remove the whole word if the word contains a word from a given list. The RE pattern has to be as generic as possible. (ie. can't hard code @county to have it removed)

Expected output:

AloneTogether · Accepted Answer · 2022-02-03 08:49:07Z

1

I am not a regex expert, but I can imagine that you could match your remove words till the next space (and previous space, in case a remove word appears at the end of a word and not the beginning) and also match # and @ if they are present:

import pandas as pd
import re 

remove_words = ['King', 'sEAttle', 'California']

remove_words_lower = (map(lambda x: '((#|@)?[^\s]*'+ x.lower() +'[^\s]*)?', remove_words))
pattern = ''.join(remove_words_lower)
t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})

df['clean_tweets'] = df.tweets.map(lambda x : re.sub(pattern, "", x.lower()).strip())

      Id                                   tweets clean_tweets
0  user1  Hello! @kingcounty Seattle, #California       hello!
1  user2                 hello! seattlecity #king       hello!

Or:

     Id                                   tweets clean_tweets
0  user1  Hello! @countyking Seattle, #California       hello!
1  user2                 hello! cityseattle #king       hello!

edited Feb 3, 2022 at 8:49

answered Feb 3, 2022 at 8:41

AloneTogether

26.8k5 gold badges23 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Michael W Over a year ago

Brillant! I do have a question. [^\s]* means match anything that's not white space, is this correct? is it the equivalent to \S* with the capitalized S?

AloneTogether Over a year ago

Yes, exactly, much everything till the next white space appears.

Collectives™ on Stack Overflow

Using regular expression to filter out pandas data frames

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related