How to convert a UTF-8 string to URL compliant string in Python?

Question

I'm sure I'm not the first one to run into this problem. But after hours of debugging, Googling and StackOverflow-ing without finding an answer, I decided to post this question. So sorry in advance if I missed anything, but by now, I'm pretty confused.

I'm using BeautifulSoup to parse a UTF-8 website. I'm using text from the website to build a URL to further crawl to. I'm running into some problems with non-English characters.

For example: the site contains the string Originální formule and I want to use it to build the URL: http://blahblah.com/Originální-formule or http://blahblah.com/origin%C3%A1ln%C3%AD-formule. The problem is, I'm getting http://blahblah.com/Origin\xe1ln\xed-formule, which produces an error. I tried to encode, decode and what-not, yet I still can't get the proper URL.

BTW, when I print u'Origin\xe1ln\xed-formule', the string prints just fine. It just encoding that doesn't succeed.

What am I doing wrong?

The question is, how to convert the string u'Origin\xe1ln\xed-formule' to something I can use with urllib2/urllib.urlopen()? — Ofirov
– Ofirov, Commented Aug 8, 2012 at 11:22
@l4mpi urlencode doesn't accept a string as a parameter. It's used to encode parameters into a parameters string. This is not my case. SanSS's answer is correct, though. — Ofirov
– Ofirov, Commented Aug 8, 2012 at 12:24

Santiago Alessandri · Accepted Answer · 2012-08-08 11:42:44Z

1

In order to achieve what you are expecting you have to do the following things:

Decompose the url
Get the path part and encode it to utf-8
Quote the path
Join each part to get back a quoted URL

You can perform these with a combination of the following functions:

urlparse.urlparse (docs)
urllib.quote (docs)
urlparse.unparse (docs)

The code will end up like this:

from urlparse import urlparse, urlunparse
from urllib import quote
x = u'http://blahblah.com/Originální-formule'
parsed_url = list(urlparse(x.encode('utf-8')))
parsed_url[2] = quote(parsed_url[2])
urlunparse(parsed_url)

Result: http://blahblah.com/Origin%C3%A1ln%C3%AD-formule

edited Aug 8, 2012 at 11:42

answered Aug 8, 2012 at 11:37

Santiago Alessandri

6,88532 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ofirov Over a year ago

I didn't need to split the url and join it again. But what I needed to do was indeed to encode it to utf-8 and then to quote it. Thanks!

bobince Over a year ago

You need to split it if you have non-ASCII characters in the hostname, as they need to be encoded using the Punycode algorithm (IDNA) rather than UTF-8+%-encode.

Collectives™ on Stack Overflow

How to convert a UTF-8 string to URL compliant string in Python?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related