6

I am new to python and was wondering if there was a better solution to match all forms of URLs that might be found in a given string. Upon googling, there seems to a lot of solutions that extract domains, replace it with links etc, but none that removes / deletes them from a string. I have mentioned some examples below for reference. Thanks!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

Error Log:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
5
  • You had better split your input text by spaces and try and see if you want that URL removed using, for instance, urlparse. Beware though that this will parse any valid URI, and foo is a valid URI. So, you might want to check if the URI is absolute etc. Commented Dec 29, 2012 at 11:09
  • This isn't a good idea. You won't catch everything, or if you do, you will decimate harmless text. Say for example I talk of PortableApps.com in the midst of a sentence; will you remove it? It will be understood as a URL by people, but you demolish the sentence if you remove it, because that is the name of the entity. Extend it one degree further; something like www.google.com will definitely be recognised by people as a web reference. Do you want to get rid of it? How about a different TLD? If I talk about examp.le? "I'm examp.le on IRC." Et cetera. So: what is your actual goal? Commented Dec 29, 2012 at 11:16
  • (I know this doesn't deal with the actual file encoding issue; I am, in fact, trying to convince you to scrap the concept. I suppose that would remove the problem.) Commented Dec 29, 2012 at 11:27
  • So then, did you read python.org/peps/pep-0263.html as it told you to? Commented Dec 29, 2012 at 11:29
  • After editing it works for me. And also you shouldn't use str to name a variable because it's a reserved keyword in python. Commented Dec 29, 2012 at 11:47

2 Answers 2

8

There's an error in your code (in fact two):

1.You should put a backslash before the penultimate single quote to escape it:

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

2.You shouldn't use str as name for a variable because it's a reserved keyword, so name it thestring or anything else

For ex:

thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

print URLless_string

with the result:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.

Sign up to request clarification or add additional context in comments.

Comments

7

Include encoding line at the top of your source file(the regex string contains non-ascii symbols like »), e.g.:

# -*- coding: utf-8 -*-
import re
...

Also surround your regex string in triple single(or double)quotes - ''' or """ instead of single as this string already contains quote symbols itself(' and ").

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''

3 Comments

All you need is # coding: utf-8. Unless you're doing -*- encoding: utf-8 -*- (note the en), there's no benefit to decorating it with the emacs -*- stuff.
Aha, that is very nice to know! At last)
Of course, refer to PEP 263 for information on it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.