Regex not matching URL in Python [duplicate]

Question

Possible Duplicate:
how to extract domain name from URL

I want to extract the website from an URL, i.e. console.aws.amazon.com from the following URL.

>>> ts
'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> re.match(ts,'(")?http(s)?://(.*?)/').group(0)

Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
re.match(ts,'(")?http(s)?://(.*?)/').group(0)
AttributeError: 'NoneType' object has no attribute 'group'

I tried this regular expression in JS and it worked. Any idea why this matches in JS, but it doesn't work in Python?

Regex or regexp if you like, but not regrex. Short for Reg ular Ex pression. — dschulz
– dschulz, Commented Jan 9, 2013 at 2:29
Vote for reopen - as this specific question is asking for a regular expression to extract the domain. The comment below the answer clarifies why urlparse is not ideal in this case - namely that an exe will be exported, and the less includes the better. — Josh Smeaton
– Josh Smeaton, Commented Jan 10, 2013 at 1:04

Ruben · Accepted Answer · 2013-01-09 02:28:28Z

5

You are doing your match incorrect. Python doco say's:

re.match(pattern, string, flags=0)

You are doing:

re.match(string, pattern)

So simply change it to:

 re.match('(")?http(s)?://(.*?)/', ts).group(0)

answered Jan 9, 2013 at 2:28

Ruben

1,4374 gold badges17 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Shawn Zhang Over a year ago

OK, that's the root cause. :)

Ruben Over a year ago

Glad you solved it ;) Although using existing tools like the peeps are suggesting below is defiantly something you should look at. Don't write stuff yourself if it already exists ;) </lazy>

hd1 Over a year ago

Why are you encouraging it then if you're recommending "don't write stuff yourself if it already exists"?

Ruben Over a year ago

Because it is a solution to the problem. The other answer are alternatives (not solutions) for the problem Shawn is having.

hd1 Over a year ago

while it is a solution, @ShawnZhang should be using urlparse, which is intended for precisely this purpose, instead of going through some convoluted regexp developed by a random internet user.

|

Josh Smeaton · Accepted Answer · 2013-01-09 02:32:33Z

5

Use urlparse

>>> from urlparse import urlparse
>>> u = 'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> p = urlparse(u)
>>> p
ParseResult(scheme='https', netloc='console.aws.amazon.com', path='/ec2/home', params='', query='region=us-east-1', fragment='s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806')
>>> p.netloc
'console.aws.amazon.com'
>>>

answered Jan 9, 2013 at 2:32

Josh Smeaton

48.8k24 gold badges137 silver badges165 bronze badges

Comments

Volatility · Accepted Answer · 2013-01-09 02:25:59Z

0

You could always use the str.partition method for this:

print(ts.partition('//')[2].partition('/')[0])
>>> console.aws.amazon.com

Regular expressions is a bit overkill for this.

answered Jan 9, 2013 at 2:25

Volatility

32.4k11 gold badges84 silver badges90 bronze badges

1 Comment

hd1 Over a year ago

Even your solution is a bit overkill as the urlparse module exists for precisely this purpose.

Collectives™ on Stack Overflow

Regex not matching URL in Python [duplicate]

3 Answers 3

6 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

1 Comment

Linked

Related