0

Possible Duplicate:
how to extract domain name from URL

I want to extract the website from an URL, i.e. console.aws.amazon.com from the following URL.

>>> ts
'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> re.match(ts,'(")?http(s)?://(.*?)/').group(0)

Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
re.match(ts,'(")?http(s)?://(.*?)/').group(0)
AttributeError: 'NoneType' object has no attribute 'group'

I tried this regular expression in JS and it worked. Any idea why this matches in JS, but it doesn't work in Python?

2
  • Regex or regexp if you like, but not regrex. Short for Reg ular Ex pression. Commented Jan 9, 2013 at 2:29
  • Vote for reopen - as this specific question is asking for a regular expression to extract the domain. The comment below the answer clarifies why urlparse is not ideal in this case - namely that an exe will be exported, and the less includes the better. Commented Jan 10, 2013 at 1:04

3 Answers 3

5

You are doing your match incorrect. Python doco say's:

re.match(pattern, string, flags=0)

You are doing:

re.match(string, pattern)

So simply change it to:

 re.match('(")?http(s)?://(.*?)/', ts).group(0)
Sign up to request clarification or add additional context in comments.

6 Comments

OK, that's the root cause. :)
Glad you solved it ;) Although using existing tools like the peeps are suggesting below is defiantly something you should look at. Don't write stuff yourself if it already exists ;) </lazy>
Why are you encouraging it then if you're recommending "don't write stuff yourself if it already exists"?
Because it is a solution to the problem. The other answer are alternatives (not solutions) for the problem Shawn is having.
while it is a solution, @ShawnZhang should be using urlparse, which is intended for precisely this purpose, instead of going through some convoluted regexp developed by a random internet user.
|
5

Use urlparse

>>> from urlparse import urlparse
>>> u = 'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> p = urlparse(u)
>>> p
ParseResult(scheme='https', netloc='console.aws.amazon.com', path='/ec2/home', params='', query='region=us-east-1', fragment='s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806')
>>> p.netloc
'console.aws.amazon.com'
>>> 

Comments

0

You could always use the str.partition method for this:

print(ts.partition('//')[2].partition('/')[0])
>>> console.aws.amazon.com

Regular expressions is a bit overkill for this.

1 Comment

Even your solution is a bit overkill as the urlparse module exists for precisely this purpose.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.