7

I have a xml file. please download it and save it as blog.xml. It is the list of my files in Google-blogger, i write some codes to parse it ,there is a something wring with lxml .

code1:

from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8")
    print   html2text(string)

It get a right result with code1.

code2:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'] 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code2.

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
 ValueError: Unicode strings with encoding declaration are not supported.

code3:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code3.

 Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
  File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
 lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid

How to handle the encode in lxml to parse html-string properly?

2 Answers 2

5
+50

There is a bug in lxml. Check output of this code:

import lxml.html
import feedparser

def test():
    try:
        lxml.html.document_fromstring('')
    except Exception as e:
        print e

d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')

test() # XMLSyntaxError: None

lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid

So the error is confusing, the real reason why your parsing fails is that you pass empty strings to document_fromstring.

Try this code:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    if not string:
        continue
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()
Sign up to request clarification or add additional context in comments.

5 Comments

I suspect that there are parse errors in the entries but that the exception is being ignored by lxml at the wrong point. Python C-API exception handling requires code to check for exceptions at certain points and if that is not done, then the exception crops up later on when another exception occurs that is handled properly. What happens if you omit the first test call? Does he same XMLSyntaxError occur?
This should certainly be reported to the LXML project in any case.
@Martijn Pieters: yes, the same error occurs, the first test call was only to show that XMLSyntaxError message changes after parsing e.
Thinking about it again, the error still reflects a previous error that has not been handled; this should certainly be reported to the devs.
I've found the bug in their bug tracker.
4

You could create yourself a parser, instead of using document_fromstring:

from cStringIO import StringIO
from lxml import etree

for num, entry in enumerate(d.entries):
    text = entry.content[0]['value'].encode('utf8')
    parser = etree.HTMLParser()
    tree   = etree.parse(StringIO(text), parser)
    print  ''.join(tree.xpath('.//text()'))

For Blogger.com Atom feed exports, this works to print the text content of the .content[0].value entry.

3 Comments

1.add from lxml import etree 2. maybe it is print tree.text_content() 3.but it is a wrong output:Traceback (most recent call last): File "<stdin>", line 5, in <module> AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text_content'
Traceback (most recent call last): File "<stdin>", line 5, in <module> AttributeError: 'lxml.etree._Element' object has no attribute 'text_content' there is a problem still.
@it_is_a_literature: my apologies, indeed that method does not exist.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.