Parsing hostname and port from string or url

Question

I can be given a string in any of these formats:

url: e.g http://www.acme.com:456
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com

I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.

I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.

I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.

what is the regex have you tried? if not regex what is the code you have wrote? — dejjub-AIS
– dejjub-AIS, Commented Mar 2, 2012 at 10:06

Maksym Kozlenko · Accepted Answer · 2013-07-21 07:17:21Z

57

You can use urlparse to get hostname from URL string:

from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com

answered Jul 21, 2013 at 7:17

Maksym Kozlenko

10.4k2 gold badges69 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1156544 Over a year ago

In Python3 use: import urllib and urllib.parse.urlparse('http://....')

ghickman · Accepted Answer · 2015-10-28 16:28:55Z

20

>>> from urlparse import urlparse   
>>> aaa = urlparse('http://www.acme.com:456')

>>> aaa.hostname  
'www.acme.com'

>>> aaa.port   
456
>>>

edited Oct 28, 2015 at 16:28

ghickman

6,05110 gold badges45 silver badges53 bronze badges

answered Jan 7, 2015 at 21:26

dfostic

1,7561 gold badge15 silver badges8 bronze badges

4 Comments

Rodrigo Laguna Over a year ago

I don't know why, but whn you run it as aaa = urlparse('www.acme.com:456') then aaa.hostname is None, do you know why? By the way, that's exactly what the question asks

ymbirtt Over a year ago

@RodrigoLaguna Real late to the party here, but this sits as an unresolved question. There's a difference between urlparse('www.acme.com:456') and urlparse('http://www.acme.com:456'). From the docs, urlparse assumes an RFC1808-compliant URL, and won't recognise the network location correctly unless it's introduced with a // - docs.python.org/2/library/urlparse.html#urlparse.urlparse.

VoteCoffee Over a year ago

Per @user1156544: In Python3 use: import urllib and urllib.parse.urlparse('http://....')

kta Over a year ago

For python3, from urllib.parse import urlparse. Ref @Maksym Kozlenko

claesv · Accepted Answer · 2013-05-28 06:07:52Z

8

I'm not that familiar with urlparse, but using regex you'd do something like:

p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'

m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'

Or, without port:

m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'

EDIT: fixed regex to also match 'www.abc.com 123'

edited May 28, 2013 at 6:07

answered Mar 2, 2012 at 9:54

claesv

2,11313 silver badges29 bronze badges

5 Comments

claesv Over a year ago

I'm assuming the down votes is because of this solution being overly complicated. I accept that, and agree with @ntziolis in saying that you should try to use standard functionality when possible.

Nhu Trinh Over a year ago

Standard urlparse wont work for string (not start with http(s) or //) so this solution seem helpful. Why downvote without explain.

Anders Kaseorg Over a year ago

This fails for URLs with literal IPv6 addresses like http://[2001:db8:85a3::8a2e:370:7334]:80/test.

sivann Over a year ago

better than urlparse which gets confused with ports even in 3.12

kta Over a year ago

Better to use built in library because so many different types of urls to handle.

ntziolis · Accepted Answer · 2012-03-02 10:03:16Z

6

The reason it fails for:

www.acme.com 456

is because it is not a valid URI. Why don't you just:

Replace the space with a :
Parse the resulting string by using the standard urlparse method

Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.

edited Mar 2, 2012 at 10:03

answered Mar 2, 2012 at 9:56

ntziolis

10.2k1 gold badge36 silver badges50 bronze badges

5 Comments

TonyM Over a year ago

When I use urlparse on host:port it puts the hostname in the scheme rather than netloc.

ntziolis Over a year ago

From the manual: "Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component."

TonyM Over a year ago

I'm not saying it's wrong, but it doesn't seem the best way for processing the hostname:port format. And adding prefixes doesn't seem very elegant.

ntziolis Over a year ago

Basically it boils down to this: 1. Do you normalize before parsing (using a standard function) or 2. do you try and use regex or something like it to handle the different formats while parsing. In my experience it's better to normalize since the regex solutions are easy to get wrong + you are replicating existing functionality.

TonyM Over a year ago

At the moment, I'm thinking I'll use urlparse on the URL and the regex by @claesv on the hostname:port format.

Ishaan · Accepted Answer · 2020-01-09 07:22:18Z

5

Method using urllib -

    from urllib.parse import urlparse
    url = 'https://stackoverflow.com/questions'
    print(urlparse(url))

Output -

ParseResult(scheme='https', netloc='stackoverflow.com', path='/questions', params='', query='', fragment='')

Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python

answered Jan 9, 2020 at 7:22

Ishaan

1,34218 silver badges28 bronze badges

Collectives™ on Stack Overflow

Parsing hostname and port from string or url

5 Answers 5

1 Comment

4 Comments

5 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

4 Comments

5 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related