1

I'm looking for a regex to remove every url or domain name from a string, so that:

string='this is my content domain.com more content http://domain2.org/content and more content domain.net/page'

becomes

'this is my content more content and more content'

Removing the most common tlds is enough for me, so I tried

string = re.sub(r'\w+(.net|.com|.org|.info|.edu|.gov|.uk|.de|.ca|.jp|.fr|.au|.us|.ru|.ch|.it|.nel|.se|.no|.es|.mil)\s?','',string)

but this is removing too much stuff and not only urls. What would be the correct syntax?

1
  • 1
    Sure, . matches any char. Commented Feb 26, 2019 at 14:05

2 Answers 2

4

you should escape all those dots, or better yet, move the dot outside the group and escape it once, also you could capture from not-space until not space, like this:

re.sub(r'[\S]+\.(net|com|org|info|edu|gov|uk|de|ca|jp|fr|au|us|ru|ch|it|nel|se|no|es|mil)[\S]*\s?','',string)

the following:
'this is my content domain.com more content http://domain2.org/content and more content domain.net/page thingynet stuffocom'
becomes:

'this is my content more content and more content thingynet stuffocom'
Sign up to request clarification or add additional context in comments.

Comments

1

This is an alternative solution:

import re
f = open('test.txt', 'r')
content = f.read()
pattern = r"[^\s]*\.(com|org|net)\S*"
result = re.sub(pattern, '', content)
print(result)

Input:

this is my content domain.com more content http://domain2.org/content and more content domain.net/page' and https://www.foo.com/page.php 

Output:

this is my content  more content  and more content  and

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.