2

I am working with some Python code that uses the lxml HTML parser to parse the HTML that a co-worker scraped from a random sample of web sites.

In two of them, I get an error of the form

"'utf8' codec can't decode byte 0xe20x80 in position 502: unexpected end of data",

and the HTML content does contain a corrupt UTF-8 character.

A variable in the code called ele is assigned to a <p> element surrounding the text with the bad character, and that text can be accessed via ele.text. Or it could be, but merely assigning ele.text to another variable causes the UnicodeDecodeError to be raised. The object of type UnicodeDecodeError that is available in the except clause contains some useful attributes such as the start and end positions of the bad bytes in the text, which could be used to create a new string from which the bad bytes have been removed, but doing anything to ele.text, such as taking a substring of it, causes a new UnicodeDetectError to be raised. Is there anything I can do to salvage the good parts of ele.text?

I am writing this from memory, and I don't remember all the details of the code, so I can supply more information tomorrow if it's useful. What I remember is that ele is an object of a type something like lxml._Element, the file being parsed really is in utf-8, and there is a place in the file where the first two utf-8 bytes of the the character that matches the entity &rdquo; is followed by the entity &rdquo;. So the text contains "xE2x80&amp;rdquo;". The error message complains about the "xE2x80" and gives their position in a string that has about 520 characters in it. I could discard the whole string if necessary, but I'd rather just use the position info to discard the "xE2x80". For some reason, doing anything with ele.text causes an error in lower level Cython code in lxml. I can provide the stack trace tomorrow when I am at work. What, if anything can I do with that text? Thanks.

1 Answer 1

1

e2 80 bytes by themselves do not cause the error:

from lxml import html

html_data = b"<p>before &ldquo;\xe2\x80&rdquo; after"
p = html.fromstring(html_data)
print(repr(p.text))
# -> u'before \u201c\xe2\x80\u201d after'

As @Esailija pointed out in the comments the above doesn't interpret the data as utf-8. To force utf-8 encoding:

from lxml import html

html_data = b"""<meta http-equiv="content-type"
                      content="text/html; charset=UTF-8">
                <p>before &ldquo;\xe2\x80&rdquo; after"""
doc = html.fromstring(html_data.decode('utf-8','ignore'))
print(repr(doc.find('.//p').text))
# -> u'before \u201c\u201d after'
  • check that utf-8 is the correct character encoding for the document
  • replace the broken byte sequence before passing it to lxml
Sign up to request clarification or add additional context in comments.

8 Comments

Yes they do, e2 is a lead byte for 3 byte long UTF-8 sequence. You can cause it just by writing "\xe2\x80".decode("utf-8")
@Esailija: what do you get if you run the code from the answer?
There is no decoding attempt (as utf-8) in your answer at all
@Esailija: try type(p.text). What do you see?
I see the result of effective "\xe2\x80".decode("iso-8859-1")
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.