1

I have a data frame that looks like the following:

enter image description here

I want to filter out all words within a list. eg. ['King', 'sEAttle', 'California']. Here is my code

import pandas as pd
import re 

remove_words = ['King', 'sEAttle', 'California']

remove_words_lower = (map(lambda x: x.lower(), remove_words))
pattern = '|'.join(remove_words_lower)

t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})


clean_tweets = []
for i, tweet in enumerate(df.tweets):
    tweet = tweet.lower()
    clean_tweet = re.sub(pattern, "", tweet)
    clean_tweets.append(clean_tweet)

df['clean_tweets'] = clean_tweets
df

Here is the result:

enter image description here

Is there a way I can modify the RE to remove @county city, and #? In other words, remove the whole word if the word contains a word from a given list. The RE pattern has to be as generic as possible. (ie. can't hard code @county to have it removed)

Expected output:

enter image description here

1 Answer 1

1

I am not a regex expert, but I can imagine that you could match your remove words till the next space (and previous space, in case a remove word appears at the end of a word and not the beginning) and also match # and @ if they are present:

import pandas as pd
import re 

remove_words = ['King', 'sEAttle', 'California']

remove_words_lower = (map(lambda x: '((#|@)?[^\s]*'+ x.lower() +'[^\s]*)?', remove_words))
pattern = ''.join(remove_words_lower)
t1 = 'Hello! @kingcounty Seattle, #California'
t2 = 'hello! seattlecity #king'
df = pd.DataFrame({'Id': ['user1', 'user2'], 'tweets': [t1, t2]})

df['clean_tweets'] = df.tweets.map(lambda x : re.sub(pattern, "", x.lower()).strip())
      Id                                   tweets clean_tweets
0  user1  Hello! @kingcounty Seattle, #California       hello!
1  user2                 hello! seattlecity #king       hello!

Or:

     Id                                   tweets clean_tweets
0  user1  Hello! @countyking Seattle, #California       hello!
1  user2                 hello! cityseattle #king       hello!
Sign up to request clarification or add additional context in comments.

2 Comments

Brillant! I do have a question. [^\s]* means match anything that's not white space, is this correct? is it the equivalent to \S* with the capitalized S?
Yes, exactly, much everything till the next white space appears.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.