32

I can be given a string in any of these formats:

I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.

I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.

I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.

1
  • what is the regex have you tried? if not regex what is the code you have wrote? Commented Mar 2, 2012 at 10:06

5 Answers 5

57

You can use urlparse to get hostname from URL string:

from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
Sign up to request clarification or add additional context in comments.

1 Comment

In Python3 use: import urllib and urllib.parse.urlparse('http://....')
20
>>> from urlparse import urlparse   
>>> aaa = urlparse('http://www.acme.com:456')

>>> aaa.hostname  
'www.acme.com'

>>> aaa.port   
456
>>> 

4 Comments

I don't know why, but whn you run it as aaa = urlparse('www.acme.com:456') then aaa.hostname is None, do you know why? By the way, that's exactly what the question asks
@RodrigoLaguna Real late to the party here, but this sits as an unresolved question. There's a difference between urlparse('www.acme.com:456') and urlparse('http://www.acme.com:456'). From the docs, urlparse assumes an RFC1808-compliant URL, and won't recognise the network location correctly unless it's introduced with a // - docs.python.org/2/library/urlparse.html#urlparse.urlparse.
Per @user1156544: In Python3 use: import urllib and urllib.parse.urlparse('http://....')
For python3, from urllib.parse import urlparse. Ref @Maksym Kozlenko
8

I'm not that familiar with urlparse, but using regex you'd do something like:

p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'

m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'

Or, without port:

m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'

EDIT: fixed regex to also match 'www.abc.com 123'

5 Comments

I'm assuming the down votes is because of this solution being overly complicated. I accept that, and agree with @ntziolis in saying that you should try to use standard functionality when possible.
Standard urlparse wont work for string (not start with http(s) or //) so this solution seem helpful. Why downvote without explain.
This fails for URLs with literal IPv6 addresses like http://[2001:db8:85a3::8a2e:370:7334]:80/test.
better than urlparse which gets confused with ports even in 3.12
Better to use built in library because so many different types of urls to handle.
6

The reason it fails for:

www.acme.com 456

is because it is not a valid URI. Why don't you just:

  1. Replace the space with a :
  2. Parse the resulting string by using the standard urlparse method

Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.

5 Comments

When I use urlparse on host:port it puts the hostname in the scheme rather than netloc.
From the manual: "Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component."
I'm not saying it's wrong, but it doesn't seem the best way for processing the hostname:port format. And adding prefixes doesn't seem very elegant.
Basically it boils down to this: 1. Do you normalize before parsing (using a standard function) or 2. do you try and use regex or something like it to handle the different formats while parsing. In my experience it's better to normalize since the regex solutions are easy to get wrong + you are replicating existing functionality.
At the moment, I'm thinking I'll use urlparse on the URL and the regex by @claesv on the hostname:port format.
5

Method using urllib -

    from urllib.parse import urlparse
    url = 'https://stackoverflow.com/questions'
    print(urlparse(url))

Output -

ParseResult(scheme='https', netloc='stackoverflow.com', path='/questions', params='', query='', fragment='')

Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.