lxml can not parse xml (wether encoding is utf-8 or not) [python]

Question

My code:

import re
import requests
from lxml import etree

url = 'http://weixin.sogou.com/gzhjs?openid=oIWsFt__d2wSBKMfQtkFfeVq_u8I&ext=2JjmXOu9jMsFW8Sh4E_XmC0DOkcPpGX18Zm8qPG7F0L5ffrupfFtkDqSOm47Bv9U'

r = requests.get(url)

items = r.json()['items']

without encode('utf-8'):

etree.fromstring(items[0]) output:

ValueError                                
Traceback (most recent call last)
<ipython-input-69-cb8697498318> in <module>()
----> 1 etree.fromstring(items[0])

lxml.etree.pyx in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)()

parser.pxi in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435)()

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

with encode('utf-8'):

etree.fromstring(items[0].encode('utf-8')) output:

  File "<string>", line unknown
XMLSyntaxError: CData section not finished
鎶楀啺鎶㈤櫓鎹锋姤:闃冲寳I绾挎, line 1, column 281

Have not idea to parse this xml..

Look at the following answer: stackoverflow.com/questions/15830421/… — Lea
– Lea, Commented Dec 4, 2015 at 9:46

falsetru · Accepted Answer · 2015-12-04 09:58:49Z

5

As a workaround, you can remove encoding attribute before pass the string to etree.fromstring:

xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1)
root = etree.fromstring(xml)

UPDATE after seeing @Lea's comment in the question:

Specify parser with explicit encoding:

xml = r.json()['items'].encode('utf-8')
root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8'))

edited Dec 4, 2015 at 9:58

answered Dec 4, 2015 at 9:32

falsetru

371k69 gold badges769 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Mithril Over a year ago

Could you mind to give me a explanation about why etree fail?if I get XMLSyntaxError, remove encoding would always work?

falsetru Over a year ago

@Mithril, I guess that mismatch between gbk and utf-8 cause parser to interpret tag as non-tag.

Mithril Over a year ago

Seems there is not a all-powerful way to solve all such kind errors. Thank you very much.

Marcel Wilson Over a year ago

I wanted to see if instead of removing the encoding from the xml if lxml could use the gbk encoding. got the following: blah = etree.fromstring(items[0].encode('gbk')) UnicodeEncodeError: 'gbk' codec can't encode character u'\ue468' in position 82: illegal multibyte sequence

Mithril Over a year ago

@Marcel Wilson I think r.json() decode the response to utf-8, so encode('gbk') not work.

Collectives™ on Stack Overflow

lxml can not parse xml (wether encoding is utf-8 or not) [python]

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related