5

My code:

import re
import requests
from lxml import etree

url = 'http://weixin.sogou.com/gzhjs?openid=oIWsFt__d2wSBKMfQtkFfeVq_u8I&ext=2JjmXOu9jMsFW8Sh4E_XmC0DOkcPpGX18Zm8qPG7F0L5ffrupfFtkDqSOm47Bv9U'

r = requests.get(url)

items = r.json()['items']
  1. without encode('utf-8'):

etree.fromstring(items[0]) output:

ValueError                                
Traceback (most recent call last)
<ipython-input-69-cb8697498318> in <module>()
----> 1 etree.fromstring(items[0])

lxml.etree.pyx in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)()

parser.pxi in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435)()

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
  1. with encode('utf-8'):

etree.fromstring(items[0].encode('utf-8')) output:

  File "<string>", line unknown
XMLSyntaxError: CData section not finished
鎶楀啺鎶㈤櫓鎹锋姤:闃冲寳I绾挎, line 1, column 281

Have not idea to parse this xml..

1

1 Answer 1

5

As a workaround, you can remove encoding attribute before pass the string to etree.fromstring:

xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1)
root = etree.fromstring(xml)

UPDATE after seeing @Lea's comment in the question:

Specify parser with explicit encoding:

xml = r.json()['items'].encode('utf-8')
root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8'))
Sign up to request clarification or add additional context in comments.

5 Comments

Could you mind to give me a explanation about why etree fail?if I get XMLSyntaxError, remove encoding would always work?
@Mithril, I guess that mismatch between gbk and utf-8 cause parser to interpret tag as non-tag.
Seems there is not a all-powerful way to solve all such kind errors. Thank you very much.
I wanted to see if instead of removing the encoding from the xml if lxml could use the gbk encoding. got the following: blah = etree.fromstring(items[0].encode('gbk')) UnicodeEncodeError: 'gbk' codec can't encode character u'\ue468' in position 82: illegal multibyte sequence
@Marcel Wilson I think r.json() decode the response to utf-8, so encode('gbk') not work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.