4

I have a small dataframe and am trying to remove the url from the end of the string in the Links column. I have tried the following code and it works on columns where the url is on its own. The problem is that as soon as there are sentences before the url the code won't remove those urls

Here is the data: https://docs.google.com/spreadsheets/d/10LV8BHgofXKTwG-MqRraj0YWez-1vcwzzTJpRhdWgew/edit?usp=sharing (link to spreadsheet)

import pandas as pd  

df = pd.read_csv('TestData.csv')    

df['Links'] = df['Links'].replace(to_replace=r'^https?:\/\/.*[\r\n]*',value='',regex=True)

df.head()

Thanks!

4
  • 2
    Please do not use links to third-party sites. Include as much relevant data as necessary in your question. Also, include the expected results. Commented Aug 23, 2018 at 21:07
  • just remove the ^ part which fixes the starting point of the sentence. That will fix your issue Commented Aug 23, 2018 at 21:49
  • @Onyambu thanks that was all that was needed. Commented Aug 24, 2018 at 13:19
  • Does this answer your question? Remove a URL row by row from a large set of text in python panda dataframe Commented Sep 9, 2020 at 22:26

3 Answers 3

8

Try this:

import re
df['cleanLinks'] = df['Links'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

Output:

df['cleanLinks']

    cleanLinks
0   random words to see if it works now 
1   more stuff that doesn't mean anything 
2   one last try please work 
Sign up to request clarification or add additional context in comments.

1 Comment

I came across this thread today and tried out this solution. For me, this just keeps the string before the URL, but deletes everything after. So if you have a cell with an URL in the middle, this does not work.
8

Try a cleaner regex:

df['example'] = df['example'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Before implementing regex in pandas .replace() or anywhere else for that matter you should test the pattern using re.sub() on a single basic string example. When faced with a big problem, break it down into a smaller one.

Additionally we could go with the str.replace method:

df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)

1 Comment

The accepted answer by OP does not work for me (deletes everything after the URL as well). Your suggestion works better, removing only URLs and keeping the rest.
1

For Dataframe df, URLs can be removed by using cleaner regex as follows:

df = pd.read_csv('./data-set.csv')
print(df['text'])

def clean_data(dataframe):
#replace URL of a text
    dataframe['text'] = dataframe['text'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')

clean_data(df)
print(df['text']);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.