2

in the process of parsing xml from a website, I've managed to get lost in a bunch of utf-8 encoding issues. Specifically, I have strings that look like:

u'PA_g\xc3\xa9p7'

When I print this I get:

>> PA_gép7

What I want instead comes from the following

print('PA_g\xc3\xa9p7')
>> PA_gép7

Here is my code:

def get_api_xml_response(base_url, query_str):
"""gets xml from api @ base_url using query_str"""
  res = requests.get(u'{}{}'.format(base_url, query_str))
  xmlstring = clean_up_xml(res.content).encode(u'utf-8')
  return ET.XML(xmlstring)

My function clean_up_xml exists to remove the namespace and other chars that were causing me problems.

def clean_up_xml(xml_string):
"""remove the namespace and invalid chars from an xml-string"""
   return re.sub(' xmlns="[^"]+"', '', xml_string, count=1).replace('&', '&')

1 Answer 1

3

You take from res.content a binary string encoded in /most probably/ UTF-8 and encode it into UTF-8 once again. Binary strings should only be decode()'d, Unicode strings should only be encode()'d, except some special cases.

Since clean_up_xml() works with binary strings, it would be better to just pass binary input into ElementTree, it will handle correctly:

xmlstring = clean_up_xml(res.content)
# let ElementTree decode content using information from the XML itself
# e.g. <?xml version="1.0" encoding="UTF-8"?>
return ET.XML(xmlstring)

If you decide to refactor code to work with unicode then all binary inputs should be decoded as soon as possible:

# let requests decode response using information from HTTP header
# e.g. Content-Type: text/xml; charset=utf-16
xmlstring = clean_up_xml(res.text)
return ET.XML(xmlstring)

When asking question related to Unicode it is important to specify Python version, in this case Python 2 with print_function imported from the future. In Python 3 you would see the following:

>>> print('PA_g\xc3\xa9p7')
PA_gép7
>>> 'PA_g\xc3\xa9p7' == u'PA_g\xc3\xa9p7'
True
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much for your answer! You were right, I was encoding where I shouldn't have been!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.