0

I try to make a pipeline voor Twitter sentiment analysis. As usual data preprocessing is a thing...

Based on real tweets I made a dataframe with only 3 rows/tweets, for experiment goal.

What I try to do: 1: clear al @, ', http etc. from the tweet. 2: after that is done I want the cleaned tweet to replace the old tweet.

This works partially: Only a part of some tweets comes back in my dataframe. As the code does clean up the tweets, the code only places a part of the original code back. I think the problem is somewhere in the tweet conversion from string to list, but after many hours trying I am unable to fix it.

The dataframe contents looks like this (only index and 1 column: Tweet) tweets are of type string

Index   Tweet
0       @justanamehere and a sentence here and a link http://www.test.com
1       @Personsname are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her .. @company1 @company2 #RETWEET https://x.something"
2      @companyx @companyex1 @company3 etc. AS lot of bad words here. It is a cancelculture, these rats want to badword https://x.Something

My code:

def strip_links(text):
            link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
            links         = re.findall(link_regex, text)
            for link in links:
                text = text.replace(link[0], ', ')    
            return text

def strip_all_entities(text):
            entity_prefixes = ['@','#']
            for separator in  string.punctuation:
                if separator not in entity_prefixes :
                    text = text.replace(separator,' ')
            words = []
            for word in text.split():
                word = word.strip()
                if word:
                    if word[0] not in entity_prefixes:
                        words.append(word)
            row['Tweet'] = ' '.join(words)   
                 
            return ' '.join(words)


# Code hieronder is nodig omdat de tekst in het df type str heeft. Omzetten naar een list.

for index, row in df_tweet.iterrows():
  tweet = list(row['Tweet'].split(","))
      
  for t in tweet: 
    strip_all_entities(strip_links(t))   

This produces this:

'and a sentence here and a link' 'are a fraud and farce' '' a lying person together with the fake media Something else Personname suppose you work with her' 'etc AS lot of bad words here It is a cancelculture' 'these rats want to badword'

But in df_tweet it shows only this:

    Tweet
0   and a sentence here and a link
1   a lying person together with the fake media So...
2   these rats want to badword

The expected result is:

index   Tweet
0       and a sentence here and a link
1       are a fraud and farce a lying person together with the fake media 
        Something else Personname? suppose you work with her
2       AS lot of bad words here It is a cancelculture these rats want to 
        badword

Thanks for helping me out!! Cheers Jan

2
  • what is the expected result ? Commented Aug 5, 2022 at 9:50
  • Good question. Made an edit to my question now. thanks in advance Commented Aug 5, 2022 at 9:55

2 Answers 2

1

try:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()

Output:

        Tweet
Index   
0       and a sentence here and a link
1       are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her
2       etc. AS lot of bad words here. It is a cancelculture, these rats want to badword

To delete only non-western characters from the tweets but keep the tweets:

df.Tweet = df.Tweet\
    .apply(lambda x: ''.join([i if i.isascii() else '' for i in x]))\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()

To delete tweets containig non-western characters:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()
df = df[df.Tweet.apply(lambda x: x.isascii())]
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks! awesome! May I ask one more question to you? how to remove non-western letters from tweets in the code you wrote for me, like: 現在総合病院産科の出産受入状況(アムステルダ (no idea what language this is). Thanks
You could also accept it as the answer to the question
sorrry, just accepted your answer!
do you want to delete only the non-western characters or the whole tweet whit non-western characters?
I just edited the answer covering both cases
|
0

Found solution to removing Chinese (or like so characters):

df_tweet.Tweet = df_tweet.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.replace(r'[^\x00-\x7f]', "", regex=True )\
    .str.strip()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.