0

I'm sure I'm not the first one to run into this problem. But after hours of debugging, Googling and StackOverflow-ing without finding an answer, I decided to post this question. So sorry in advance if I missed anything, but by now, I'm pretty confused.

I'm using BeautifulSoup to parse a UTF-8 website. I'm using text from the website to build a URL to further crawl to. I'm running into some problems with non-English characters.

For example: the site contains the string Originální formule and I want to use it to build the URL: http://blahblah.com/Originální-formule or http://blahblah.com/origin%C3%A1ln%C3%AD-formule. The problem is, I'm getting http://blahblah.com/Origin\xe1ln\xed-formule, which produces an error. I tried to encode, decode and what-not, yet I still can't get the proper URL.

BTW, when I print u'Origin\xe1ln\xed-formule', the string prints just fine. It just encoding that doesn't succeed.

What am I doing wrong?

4
  • 3
    ... We don't know. What are you doing? Commented Aug 8, 2012 at 10:49
  • The question is, how to convert the string u'Origin\xe1ln\xed-formule' to something I can use with urllib2/urllib.urlopen()? Commented Aug 8, 2012 at 11:22
  • have you tried the urlencode function? Commented Aug 8, 2012 at 11:29
  • @l4mpi urlencode doesn't accept a string as a parameter. It's used to encode parameters into a parameters string. This is not my case. SanSS's answer is correct, though. Commented Aug 8, 2012 at 12:24

1 Answer 1

1

In order to achieve what you are expecting you have to do the following things:

  1. Decompose the url
  2. Get the path part and encode it to utf-8
  3. Quote the path
  4. Join each part to get back a quoted URL

You can perform these with a combination of the following functions:

  • urlparse.urlparse (docs)
  • urllib.quote (docs)
  • urlparse.unparse (docs)

The code will end up like this:

from urlparse import urlparse, urlunparse
from urllib import quote
x = u'http://blahblah.com/Originální-formule'
parsed_url = list(urlparse(x.encode('utf-8')))
parsed_url[2] = quote(parsed_url[2])
urlunparse(parsed_url)

Result: http://blahblah.com/Origin%C3%A1ln%C3%AD-formule

Sign up to request clarification or add additional context in comments.

2 Comments

I didn't need to split the url and join it again. But what I needed to do was indeed to encode it to utf-8 and then to quote it. Thanks!
You need to split it if you have non-ASCII characters in the hostname, as they need to be encoded using the Punycode algorithm (IDNA) rather than UTF-8+%-encode.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.