Python - Remove URLs from text with regex

Question

I have URLs in a text that look like this:

<https://buy.itunes.apple.com/WebObjects/MZFinance.woa/wa/reportAProblem?p
=22000073760328&o=i>

I've used the following pattern to try and remove them:

re.sub(r'\<http.+?\>', '', plain, re.S)

But it won't get them all, for example, this one doesn't get removed:

<http://ax.phobos.apple.com.edgesuite.net/email/images_shared/spacer_99999\r\n9.gif>

If you put r (raw string) before assigining the second string (r'<http://ax.phobos.apple.com.edgesuite.net/email/images_shared/spacer_99999\r\n9.gif>') or put double backslash (\\) (<http://ax.phobos.apple.com.edgesuite.net/email/images_shared/spacer_99999\\r\\n9.gif>) it will work — 4d4c
– 4d4c, Commented Mar 29, 2013 at 20:35
This is pretty odd. Played around with it for a bit and it does match it: re.match(r'.', '\n', re.S) works, but re.sub(r'.', '', '\n', re.S) does not. So it seems to match, but the replacing part fails somehow... really not sure where or how though. It's as if re.S doesn't work for re.sub. — Stjepan Bakrac
– Stjepan Bakrac, Commented Mar 29, 2013 at 20:39
Yeah that's what happens to me. Some URLs are removed but others remain. — 8vius
– 8vius, Commented Mar 29, 2013 at 20:41

yonili · Accepted Answer · 2013-03-29 20:40:36Z

7

Try it like this

p=re.compile(r'\<http.+?\>', re.DOTALL)
re.sub(p, '', plain)

answered Mar 29, 2013 at 20:40

yonili

7437 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

8vius Over a year ago

This did it, thank you. Care to add an explanation as to why the precompiled pattern works?

yonili Over a year ago

Actually after taking a look at the re.sub function I think you missed that there is an additional argument before the flags argument, so something like re.sub(r'\<http.+?\>', '', plain, flags=re.S) should also work.

Stjepan Bakrac Over a year ago

@8vius The flag is being passed incorrectly for some reason, although I really don't know why. This encodes the flag in the pattern itself. According to the docs, re.sub takes five arguments (pattern, repl, str, count, flags), the last two being optional. However, when I try to call it with 5 arguments, it tells me it expects 4. In Python 3 it works when I do re.sub(r'.', '', '\n', 0, re.S), as well as re.sub(r'.', '', '\n', flags=re.S), neither of which works for me in Python 2, despite what the docs for it say.

8vius Over a year ago

Yes, setting the flags explicitly works as well, thanks. That's why it works with the precompiled then. Thank you both.

Collectives™ on Stack Overflow

Python - Remove URLs from text with regex

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related