Changing dataframe values after regex function problem

Question

I try to make a pipeline voor Twitter sentiment analysis. As usual data preprocessing is a thing...

Based on real tweets I made a dataframe with only 3 rows/tweets, for experiment goal.

What I try to do: 1: clear al @, ', http etc. from the tweet. 2: after that is done I want the cleaned tweet to replace the old tweet.

This works partially: Only a part of some tweets comes back in my dataframe. As the code does clean up the tweets, the code only places a part of the original code back. I think the problem is somewhere in the tweet conversion from string to list, but after many hours trying I am unable to fix it.

The dataframe contents looks like this (only index and 1 column: Tweet) tweets are of type string

Index   Tweet
0       @justanamehere and a sentence here and a link http://www.test.com
1       @Personsname are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her .. @company1 @company2 #RETWEET https://x.something"
2      @companyx @companyex1 @company3 etc. AS lot of bad words here. It is a cancelculture, these rats want to badword https://x.Something

My code:

def strip_links(text):
            link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
            links         = re.findall(link_regex, text)
            for link in links:
                text = text.replace(link[0], ', ')    
            return text

def strip_all_entities(text):
            entity_prefixes = ['@','#']
            for separator in  string.punctuation:
                if separator not in entity_prefixes :
                    text = text.replace(separator,' ')
            words = []
            for word in text.split():
                word = word.strip()
                if word:
                    if word[0] not in entity_prefixes:
                        words.append(word)
            row['Tweet'] = ' '.join(words)   
                 
            return ' '.join(words)


# Code hieronder is nodig omdat de tekst in het df type str heeft. Omzetten naar een list.

for index, row in df_tweet.iterrows():
  tweet = list(row['Tweet'].split(","))
      
  for t in tweet: 
    strip_all_entities(strip_links(t))

This produces this:

'and a sentence here and a link' 'are a fraud and farce' '' a lying person together with the fake media Something else Personname suppose you work with her' 'etc AS lot of bad words here It is a cancelculture' 'these rats want to badword'

But in df_tweet it shows only this:

    Tweet
0   and a sentence here and a link
1   a lying person together with the fake media So...
2   these rats want to badword

The expected result is:

index   Tweet
0       and a sentence here and a link
1       are a fraud and farce a lying person together with the fake media 
        Something else Personname? suppose you work with her
2       AS lot of bad words here It is a cancelculture these rats want to 
        badword

Thanks for helping me out!! Cheers Jan

Good question. Made an edit to my question now. thanks in advance — Janneman
– Janneman, Commented Aug 5, 2022 at 9:55

99_m4n · Accepted Answer · 2022-08-05 11:49:07Z

1

try:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()

Output:

        Tweet
Index   
0       and a sentence here and a link
1       are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her
2       etc. AS lot of bad words here. It is a cancelculture, these rats want to badword

To delete only non-western characters from the tweets but keep the tweets:

df.Tweet = df.Tweet\
    .apply(lambda x: ''.join([i if i.isascii() else '' for i in x]))\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()

To delete tweets containig non-western characters:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()
df = df[df.Tweet.apply(lambda x: x.isascii())]

edited Aug 5, 2022 at 11:49

answered Aug 5, 2022 at 11:06

99_m4n

1,2655 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Janneman Over a year ago

Thanks! awesome! May I ask one more question to you? how to remove non-western letters from tweets in the code you wrote for me, like: 現在総合病院産科の出産受入状況(アムステルダ (no idea what language this is). Thanks

99_m4n Over a year ago

You could also accept it as the answer to the question

Janneman Over a year ago

sorrry, just accepted your answer!

99_m4n Over a year ago

do you want to delete only the non-western characters or the whole tweet whit non-western characters?

99_m4n Over a year ago

I just edited the answer covering both cases

|

Janneman · Accepted Answer · 2022-08-05 11:34:30Z

0

Found solution to removing Chinese (or like so characters):

df_tweet.Tweet = df_tweet.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.replace(r'[^\x00-\x7f]', "", regex=True )\
    .str.strip()

answered Aug 5, 2022 at 11:34

Janneman

3514 silver badges14 bronze badges

Collectives™ on Stack Overflow

Changing dataframe values after regex function problem

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related