Remove all forms of URLs from a given string in Python

Question

I am new to python and was wondering if there was a better solution to match all forms of URLs that might be found in a given string. Upon googling, there seems to a lot of solutions that extract domains, replace it with links etc, but none that removes / deletes them from a string. I have mentioned some examples below for reference. Thanks!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

Error Log:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

You had better split your input text by spaces and try and see if you want that URL removed using, for instance, urlparse. Beware though that this will parse any valid URI, and foo is a valid URI. So, you might want to check if the URI is absolute etc. — fge
– fge, Commented Dec 29, 2012 at 11:09
This isn't a good idea. You won't catch everything, or if you do, you will decimate harmless text. Say for example I talk of PortableApps.com in the midst of a sentence; will you remove it? It will be understood as a URL by people, but you demolish the sentence if you remove it, because that is the name of the entity. Extend it one degree further; something like www.google.com will definitely be recognised by people as a web reference. Do you want to get rid of it? How about a different TLD? If I talk about examp.le? "I'm examp.le on IRC." Et cetera. So: what is your actual goal? — Chris Morgan
– Chris Morgan, Commented Dec 29, 2012 at 11:16
(I know this doesn't deal with the actual file encoding issue; I am, in fact, trying to convince you to scrap the concept. I suppose that would remove the problem.) — Chris Morgan
– Chris Morgan, Commented Dec 29, 2012 at 11:27
So then, did you read python.org/peps/pep-0263.html as it told you to? — Chris Morgan
– Chris Morgan, Commented Dec 29, 2012 at 11:29
After editing it works for me. And also you shouldn't use str to name a variable because it's a reserved keyword in python. — doru
– doru, Commented Dec 29, 2012 at 11:47

doru · Accepted Answer · 2012-12-29 12:35:44Z

There's an error in your code (in fact two):

1.You should put a backslash before the penultimate single quote to escape it:

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

2.You shouldn't use str as name for a variable because it's a reserved keyword, so name it thestring or anything else

For ex:

thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

print URLless_string

with the result:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.

kerim · Accepted Answer · 2012-12-29 11:25:11Z

7

Include encoding line at the top of your source file(the regex string contains non-ascii symbols like »), e.g.:

# -*- coding: utf-8 -*-
import re
...

Also surround your regex string in triple single(or double)quotes - ''' or """ instead of single as this string already contains quote symbols itself(' and ").

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''

answered Dec 29, 2012 at 11:25

kerim

2,52221 silver badges16 bronze badges

3 Comments

Chris Morgan Over a year ago

All you need is # coding: utf-8. Unless you're doing -*- encoding: utf-8 -*- (note the en), there's no benefit to decorating it with the emacs -*- stuff.

kerim Over a year ago

Aha, that is very nice to know! At last)

Chris Morgan Over a year ago

Of course, refer to PEP 263 for information on it.

Collectives™ on Stack Overflow

Remove all forms of URLs from a given string in Python

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related