lxml - UnicodeDecodeError when accessing text of an element

Question

I am working with some Python code that uses the lxml HTML parser to parse the HTML that a co-worker scraped from a random sample of web sites.

In two of them, I get an error of the form

"'utf8' codec can't decode byte 0xe20x80 in position 502: unexpected end of data",

and the HTML content does contain a corrupt UTF-8 character.

A variable in the code called ele is assigned to a <p> element surrounding the text with the bad character, and that text can be accessed via ele.text. Or it could be, but merely assigning ele.text to another variable causes the UnicodeDecodeError to be raised. The object of type UnicodeDecodeError that is available in the except clause contains some useful attributes such as the start and end positions of the bad bytes in the text, which could be used to create a new string from which the bad bytes have been removed, but doing anything to ele.text, such as taking a substring of it, causes a new UnicodeDetectError to be raised. Is there anything I can do to salvage the good parts of ele.text?

I am writing this from memory, and I don't remember all the details of the code, so I can supply more information tomorrow if it's useful. What I remember is that ele is an object of a type something like lxml._Element, the file being parsed really is in utf-8, and there is a place in the file where the first two utf-8 bytes of the the character that matches the entity ” is followed by the entity ”. So the text contains "xE2x80&rdquo;". The error message complains about the "xE2x80" and gives their position in a string that has about 520 characters in it. I could discard the whole string if necessary, but I'd rather just use the position info to discard the "xE2x80". For some reason, doing anything with ele.text causes an error in lower level Cython code in lxml. I can provide the stack trace tomorrow when I am at work. What, if anything can I do with that text? Thanks.

jfs · Accepted Answer · 2013-01-16 22:07:37Z

1

e2 80 bytes by themselves do not cause the error:

from lxml import html

html_data = b"<p>before &ldquo;\xe2\x80&rdquo; after"
p = html.fromstring(html_data)
print(repr(p.text))
# -> u'before \u201c\xe2\x80\u201d after'

As @Esailija pointed out in the comments the above doesn't interpret the data as utf-8. To force utf-8 encoding:

from lxml import html

html_data = b"""<meta http-equiv="content-type"
                      content="text/html; charset=UTF-8">
                <p>before &ldquo;\xe2\x80&rdquo; after"""
doc = html.fromstring(html_data.decode('utf-8','ignore'))
print(repr(doc.find('.//p').text))
# -> u'before \u201c\u201d after'

check that utf-8 is the correct character encoding for the document
replace the broken byte sequence before passing it to lxml

edited Jan 16, 2013 at 22:07

answered Jan 15, 2013 at 8:35

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Esailija Over a year ago

Yes they do, e2 is a lead byte for 3 byte long UTF-8 sequence. You can cause it just by writing "\xe2\x80".decode("utf-8")

jfs Over a year ago

@Esailija: what do you get if you run the code from the answer?

Esailija Over a year ago

There is no decoding attempt (as utf-8) in your answer at all

jfs Over a year ago

@Esailija: try type(p.text). What do you see?

Esailija Over a year ago

I see the result of effective "\xe2\x80".decode("iso-8859-1")

|

Collectives™ on Stack Overflow

lxml - UnicodeDecodeError when accessing text of an element

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related