83

I want to remove all URLs inside a string (replace them with "") I searched around but couldn't really find what I want.

Example:

text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/

I want the result to be:

text1
text2
text3
text4
text5
text6
3
  • 9
    Are you sure you've researched sufficiently? Have you tried regular expressions? Commented Jul 4, 2012 at 15:32
  • 2
    Yes but I didn't really understand how to do it in my example.. Commented Jul 4, 2012 at 15:34
  • 3
    Have you looked at stackoverflow.com/questions/520031/… Commented Jul 4, 2012 at 15:41

15 Answers 15

123

the shortest way

re.sub(r'http\S+', '', stringliteral)
Sign up to request clarification or add additional context in comments.

8 Comments

This will also remove 'httpabc' and 'abchttp'.
@LouisYang huh? it shouldn't (and doesn't; at least on 3.7) remove abchttp. You'd have to use .*http or something like that. BTW, I'd suggest r'https?://\S+'.
this is the best solution and should be marked as the right answer
You can also write it like text = re.sub(r"\S*https?:\S*", "", text) to remove the https even if they're in paranthesis or brackets.
@henley, above code did not work for my text: '''$0.29 non-gaap diluted income per share. [$29 million after tax] on the revaluation, http : //www.businesswire.com/news/home/20210217005928/en .-AAAAA-santa clara,"""
|
95

Python script:

import re
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)

Output:

text1
text2
text3
text4
text5
text6

Test this code here.

2 Comments

This solution assumes that any URL is immediately follows by a new line (which is the case in the OP's example, but just FYI). tolgayilmaz's regular expression doesn't have this potential shortcoming.
@FranckDernoncourt Interesting because this was not the case for the twitter dataset I am working with. Above code removed all urls despite them not being immediately followed by a new line
30

This worked for me:

import re
thestring = "text1\ntext2\nhttp://url.com/bla1/blah1/\ntext3\ntext4\nhttp://url.com/bla2/blah2/\ntext5\ntext6"

URLless_string = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', thestring)
print URLless_string

Result:

text1
text2

text3
text4

text5
text6

Comments

19

Removal of HTTP links/URLs mixed up in any text:

import re
re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", text)

3 Comments

This method hangs for me when parsing a string with '[<link>](<link>)'. Any idea why?
above method worked for me. I think this is most comprehensive solution.
just a stern warning, i attempted to use this exact regex to remove URLS from a ve...eeery long text in ONE , just ONE record in swifter + pandas dataframe. and after waiting for hours it didn't seem to end... then I used this one stackoverflow.com/a/38498442/1465073, and the same text took like a fraction of a second to finish. I lost like 12-24 hours worth of work just trying to figure out what was happening. No errors, no warnings, just my apply function seemingly frozen for hours. Im using 13700K, 2 x 16 DDR5 6000, RTX 4090. The same issue manifested in Azure cloud A100 and V100s
18

This solution caters for http, https and the other normal url type special characters :

import re
def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)


print( remove_urls("this is a test https://sdfs.sdfsdf.com/sdfsdf/sdfsdf/sd/sdfsdfs?bob=%20tree&jef=man lets see this too https://sdfsdf.fdf.com/sdf/f end"))

3 Comments

It doesn't work if the URL content a hyphen, e.g. print(remove_urls("this https://sdfs-sdfsdf.com yo")) -> this is a test -sdfsdf.com yo
Use this instead (r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%|\-)*\b'
I like the idea of this, but why is (http|https) optional? Do any URLs begin with ://? I have had decent success with (https|http|ftp):\/\/\S+
15

I wasn't able to find any that handled my particular situation, which was removing urls in the middle of tweets that also have whitespaces in the middle of urls so I made my own:

(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*

here's an explanation:
(https?:\/\/) matches http:// or https://
(\s)* optional whitespaces
(www\.)? optionally matches www.
(\s)* optionally matches whitespaces
((\w|\s)+\.)* matches 0 or more of one or more word characters followed by a period
([\w\-\s]+\/)* matches 0 or more of one or more words(or a dash or a space) followed by '\'
([\w\-]+) any remaining path at the end of the url followed by an optional ending
((\?)?[\w\s]*=\s*[\w\%&]*)* matches ending query params (even with white spaces,etc)

test this out here:https://regex101.com/r/NmVGOo/8

3 Comments

Please edit your answer to include the explanation. Links can go dead.
@Gabriel, I have modified your code a little bit so that it works for both http and https: (?:(https|http)\s?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*
@tursunWali it already works for http and https, please see the attached testing link. Thank you
15

What you really want to do is to remove any string that starts with either http:// or https:// plus any combination of non white space characters. Here is how I would solve it. My solution is very similar to that of @tolgayilmaz

#Define the text from which you want to replace the url with "".
text ='''The link to this post is https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python'''

import re
#Either use:
re.sub('http://\S+|https://\S+', '', text)
#OR 
re.sub('http[s]?://\S+', '', text)

And the result of running either code above is

>>> 'The link to this post is '

I prefer the second one because it is more readable.

4 Comments

what if there is white space characters like : http : //www.businesswire.com/news/home/20210217005928/en .
Hmmm! will that still be a url?
if we want to remove such word groups from file, then what to do? (if I modify my question)
I think that would be a new and different question. Not the one being asked here. My answer is for the question that was posted here.
10

In order to remove any URL within a string in Python, you can use this RegEx function :

import re

def remove_URL(text):
    """Remove URLs from a text string"""
    return re.sub(r"http\S+", "", text)

1 Comment

this is should be the best answer!
7

I know this has already been answered and its stupid late but I think this should be here. This is a regex that matches any kind of url.

[^ ]+\.[^ ]+

It can be used like

re.sub('[^ ]+\.[^ ]+','',sentence)

5 Comments

This is only a regex, this does not replace anything and thus this isn't answering the question.
@AndréKool this is for matching any kind of url. for replacing there are already alot of answers above
In that case i suggest you edit your answer to explain that to avoid any confusion.
This worked for me! Thanks! It is a very elegant and eficient solution to match url starting with and without http(s) and www.
indeed , very elegant and simple solution, covers ALL cases, thank you man
6

You could also look at it from the other way around...

from urlparse import urlparse
[el for el in ['text1', 'FTP://somewhere.com', 'text2', 'http://blah.com:8080/foo/bar#header'] if not urlparse(el).scheme]

Comments

3

The following regular expression in Python works well for detecting URL(s) in the text:

source_text = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6    '''

import re
url_reg  = r'[a-z]*[:.]+\S+'
result   = re.sub(url_reg, '', source_text)
print(result)

Output:

text1
text2

text3
text4

text5
text6

2 Comments

The question was answered 5 years ago. What new value does your answer bring?
This will delete lines like text1:text2, that is not wanted.
1

why do not use this its so complete

i = re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)","",i)

Comments

0
import re
s = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/'''
g = re.findall(r'(text\d+)',s)
print ('list',g)
for i in g:
    print (i)

Out

list ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']
text1
text2
text3
text4
text5
text6    ​

1 Comment

The text is just an example, not a keyword. It can be any sentence or word.
0

I think the most general URL regex pattern is this one:

URL_PATTERN = r'[A-Za-z0-9]+://[A-Za-z0-9%-_]+(/[A-Za-z0-9%-_])*(#|\\?)[A-Za-z0-9%-_&=]*'

There is a small module that does what do you want:

pip install mysmallutils
from mysutils.text import remove_urls

remove_urls(text)

Comments

0

A simple .* with a positive look behind should do the job.

text="text1\ntext2\nhttp://url.com/bla1/blah1/\ntext3\ntext4\nhttp://url.com/bla2/blah2/\ntext5\ntext6"

req=re.sub(r'http.*?(?=\s)', " ", text)
print(req)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.