Removing URL from a column in Pandas Dataframe

Question

I have a small dataframe and am trying to remove the url from the end of the string in the Links column. I have tried the following code and it works on columns where the url is on its own. The problem is that as soon as there are sentences before the url the code won't remove those urls

Here is the data: https://docs.google.com/spreadsheets/d/10LV8BHgofXKTwG-MqRraj0YWez-1vcwzzTJpRhdWgew/edit?usp=sharing (link to spreadsheet)

import pandas as pd  

df = pd.read_csv('TestData.csv')    

df['Links'] = df['Links'].replace(to_replace=r'^https?:\/\/.*[\r\n]*',value='',regex=True)

df.head()

Thanks!

Please do not use links to third-party sites. Include as much relevant data as necessary in your question. Also, include the expected results. — DYZ
– DYZ, Commented Aug 23, 2018 at 21:07
just remove the ^ part which fixes the starting point of the sentence. That will fix your issue — Onyambu
– Onyambu, Commented Aug 23, 2018 at 21:49
Does this answer your question? Remove a URL row by row from a large set of text in python panda dataframe — Abu Shoeb
– Abu Shoeb, Commented Sep 9, 2020 at 22:26

Vishnu Kunchur · Accepted Answer · 2018-08-23 21:28:00Z

8

Try this:

import re
df['cleanLinks'] = df['Links'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

Output:

df['cleanLinks']

    cleanLinks
0   random words to see if it works now 
1   more stuff that doesn't mean anything 
2   one last try please work

edited Aug 23, 2018 at 21:28

answered Aug 23, 2018 at 21:21

Vishnu Kunchur

1,73611 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PretendNotToSuck Over a year ago

I came across this thread today and tried out this solution. For me, this just keeps the string before the URL, but deletes everything after. So if you have a cell with an URL in the middle, this does not work.

Philip DiSarro · Accepted Answer · 2018-08-23 21:28:22Z

8

Try a cleaner regex:

df['example'] = df['example'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Before implementing regex in pandas .replace() or anywhere else for that matter you should test the pattern using re.sub() on a single basic string example. When faced with a big problem, break it down into a smaller one.

Additionally we could go with the str.replace method:

df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)

edited Aug 23, 2018 at 21:28

answered Aug 23, 2018 at 21:14

Philip DiSarro

1,0356 silver badges10 bronze badges

1 Comment

PretendNotToSuck Over a year ago

The accepted answer by OP does not work for me (deletes everything after the URL as well). Your suggestion works better, removing only URLs and keeping the rest.

Isurie · Accepted Answer · 2021-01-21 09:59:25Z

1

For Dataframe df, URLs can be removed by using cleaner regex as follows:

df = pd.read_csv('./data-set.csv')
print(df['text'])

def clean_data(dataframe):
#replace URL of a text
    dataframe['text'] = dataframe['text'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')

clean_data(df)
print(df['text']);

answered Jan 21, 2021 at 9:59

Isurie

3204 silver badges11 bronze badges

Collectives™ on Stack Overflow

Removing URL from a column in Pandas Dataframe

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related