Python regex to remove urls and domain names in string

Question

I'm looking for a regex to remove every url or domain name from a string, so that:

string='this is my content domain.com more content http://domain2.org/content and more content domain.net/page'

becomes

'this is my content more content and more content'

Removing the most common tlds is enough for me, so I tried

string = re.sub(r'\w+(.net|.com|.org|.info|.edu|.gov|.uk|.de|.ca|.jp|.fr|.au|.us|.ru|.ch|.it|.nel|.se|.no|.es|.mil)\s?','',string)

but this is removing too much stuff and not only urls. What would be the correct syntax?

Sure, . matches any char.

Wiktor Stribiżew
– Wiktor Stribiżew

2019-02-26 14:05:53 +00:00
Commented Feb 26, 2019 at 14:05 — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 26, 2019 at 14:05

Nullman · Accepted Answer · 2019-02-26 14:13:42Z

4

you should escape all those dots, or better yet, move the dot outside the group and escape it once, also you could capture from not-space until not space, like this:

re.sub(r'[\S]+\.(net|com|org|info|edu|gov|uk|de|ca|jp|fr|au|us|ru|ch|it|nel|se|no|es|mil)[\S]*\s?','',string)

the following:
'this is my content domain.com more content http://domain2.org/content and more content domain.net/page thingynet stuffocom'
becomes:

'this is my content more content and more content thingynet stuffocom'

answered Feb 26, 2019 at 14:13

Nullman

4,2842 gold badges18 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ceylan B. · Accepted Answer · 2019-02-26 14:40:43Z

1

This is an alternative solution:

import re
f = open('test.txt', 'r')
content = f.read()
pattern = r"[^\s]*\.(com|org|net)\S*"
result = re.sub(pattern, '', content)
print(result)

Input:

this is my content domain.com more content http://domain2.org/content and more content domain.net/page' and https://www.foo.com/page.php

Output:

this is my content  more content  and more content  and

answered Feb 26, 2019 at 14:40

Ceylan B.

5821 gold badge9 silver badges23 bronze badges

Collectives™ on Stack Overflow

Python regex to remove urls and domain names in string

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related