3

I have the following regex:

r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

When I apply this to a text string with, let's say, "this is www.website1.com and this is website2.com", I get:

['www.website1.com']

['website.com']

How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...

1

2 Answers 2

4

Try this one (thanks @SunDeep for the update):

\s(?:www.)?(\w+.com)

Explanation

\s matches any whitespace character

(?:www.)? non-capturing group, matches www. 0 or more times

(\w+.com) matches any word character one or more times, followed by .com

And in action:

import re

s = 'this is www.website1.com and this is website2.com'

matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)

Output:

['website1.com', 'website2.com']

A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.

This answer has a lot of helpful info about matching domains: What is a regular expression which will match a valid domain name without a subdomain?

Next, I only look for .com domains, you could adjust my regular expression to something like:

\s(?:www.)?(\w+.(com|org|net))

To match whichever types of domains you were looking for.

Sign up to request clarification or add additional context in comments.

Comments

0

Here a try :

import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)

O/P like :

'website1.com'

if it is s = "website1.com" also it will o/p like :

'website1.com'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.