1. Home
2. Questions
3. AI Assist Labs
4. Tags
6. Challenges
7. Chat
8. Articles
9. Users
11. Jobs
12. Companies
13. Collectives
14. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Stack Internal
Bring the best of human thought and AI automation together at your work. Learn more

How to handle the encode in lxml to parse html-string properly?

Ask Question

Asked 12 years, 7 months ago

Modified 12 years, 7 months ago

Viewed 6k times

7

I have a xml file. please download it and save it as blog.xml. It is the list of my files in Google-blogger, i write some codes to parse it ,there is a something wring with lxml .

code1:

from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8")
    print   html2text(string)

It get a right result with code1.

code2:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'] 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code2.

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
 ValueError: Unicode strings with encoding declaration are not supported.

code3:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code3.

 Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
  File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
 lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid

How to handle the encode in lxml to parse html-string properly?

asked Apr 7, 2013 at 11:45

showkey

37551 gold badges169 silver badges329 bronze badges

Add a comment |

2 Answers 2

Sorted by:

5

+50

There is a bug in lxml. Check output of this code:

import lxml.html
import feedparser

def test():
    try:
        lxml.html.document_fromstring('')
    except Exception as e:
        print e

d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')

test() # XMLSyntaxError: None

lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid

So the error is confusing, the real reason why your parsing fails is that you pass empty strings to document_fromstring.

Try this code:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    if not string:
        continue
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

edited Apr 16, 2013 at 10:34

answered Apr 9, 2013 at 12:39

gatto

2,9571 gold badge24 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Martijn Pieters Over a year ago

I suspect that there are parse errors in the entries but that the exception is being ignored by lxml at the wrong point. Python C-API exception handling requires code to check for exceptions at certain points and if that is not done, then the exception crops up later on when another exception occurs that is handled properly. What happens if you omit the first test call? Does he same XMLSyntaxError occur?

Martijn Pieters Over a year ago

This should certainly be reported to the LXML project in any case.

gatto Over a year ago

@Martijn Pieters: yes, the same error occurs, the first test call was only to show that XMLSyntaxError message changes after parsing e.

Martijn Pieters Over a year ago

Thinking about it again, the error still reflects a previous error that has not been handled; this should certainly be reported to the devs.

gatto Over a year ago

I've found the bug in their bug tracker.

4

You could create yourself a parser, instead of using document_fromstring:

from cStringIO import StringIO
from lxml import etree

for num, entry in enumerate(d.entries):
    text = entry.content[0]['value'].encode('utf8')
    parser = etree.HTMLParser()
    tree   = etree.parse(StringIO(text), parser)
    print  ''.join(tree.xpath('.//text()'))

For Blogger.com Atom feed exports, this works to print the text content of the .content[0].value entry.

edited Apr 9, 2013 at 14:40

answered Apr 7, 2013 at 11:50

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

3 Comments

showkey Over a year ago

1.add from lxml import etree 2. maybe it is print tree.text_content() 3.but it is a wrong output:Traceback (most recent call last): File "<stdin>", line 5, in <module> AttributeError: 'lxml.etree._ElementTree' object has no attribute 'text_content'

showkey Over a year ago

Traceback (most recent call last): File "<stdin>", line 5, in <module> AttributeError: 'lxml.etree._Element' object has no attribute 'text_content' there is a problem still.

Martijn Pieters Over a year ago

@it_is_a_literature: my apologies, indeed that method does not exist.

Your Answer

Sign up or log in

Post as a guest

Name

Email

Required, but never shown

Post as a guest

Name

Email

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.