How to parse broken XML in Python?

Question

A sever I can't influence sends very broken XML.

Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86 (9 bytes) in a file that's declared as utf-8 with no DTD.

I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encountering this. I'd like to avoid BeautifulSoup for speed reasons.

What else is there?

Not clear: Are you saying that the server is sending the XML header: "<?xml version="1.0" encoding="UTF-8"?>" with xml somewhere containing: "&acirc\x98\x86" ? — Steven D. Majewski
– Steven D. Majewski, Commented Aug 26, 2010 at 18:55
Exactly. I don't know at what point the server encodes the entities, so I'm reluctant to just reverse it before even calling a parser. — Tobias
– Tobias, Commented Aug 26, 2010 at 19:48
lxml.html parser ( and probably Beautiful Soup ) can PARSE that broken XML, but they can't fix it so you get a Unicode WHITE STAR out (and I don't think you can fix it with a SAX entity handler either). You'll probably have to fix the byte stream using re.sub and htmlentitydefs before passing it to the parser. ( I wonder what sort of process can be writing such broken output ? one part of the process must think it's writing Latin-1 HTML and another thinks it's producing UTF8 XML! ) — Steven D. Majewski
– Steven D. Majewski, Commented Aug 26, 2010 at 20:59
It's an easy mistake to make in a language without native unicode support. AFAIK the server is written in PHP... — Tobias
– Tobias, Commented Aug 26, 2010 at 22:27
If the XML is not well formed, i.e. broken, then get whoever is generating it to generate it correctly. Similar if it does not conform to the DTD or schema which it is supposed to, return to sender. — David Gillen
– David Gillen, Commented Jan 22, 2013 at 10:46

Manoj Govindan · Accepted Answer · 2010-08-26 17:23:58Z

2

BeautifulSoup is your best bet in this case. I suggest profiling before ruling out BeautifulSoup altogether.

answered Aug 26, 2010 at 17:23

Manoj Govindan

75.2k21 gold badges138 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

leoluk Over a year ago

"[...] you don't really care what HTML is supposed to look like. Neither does this parser. " :-)

Tobias Over a year ago

I did, and it's orders of magnitude slower than the lxml.objectify I'm using now (accepting a few broken strings in the UI)

n611x007 Over a year ago

@Tobias could you post some actual results and version numbers? would be useful for reference to others. Yes I know this is an old question - just in case. :)

Steven D. Majewski · Accepted Answer · 2010-08-26 21:33:53Z

0

Maybe something like:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )

edited Aug 26, 2010 at 21:33

answered Aug 26, 2010 at 21:18

Steven D. Majewski

2,17715 silver badges16 bronze badges

3 Comments

Tobias Over a year ago

You need to remove the five XML entities from htmlentitydefs to avoid unescaping <>.

Tobias Over a year ago

As I said I'm reluctant to do this because it looks like the server only entity-encodes the contents of one specific tag.

Steven D. Majewski Over a year ago

The problem is that I don't think you can do it from SAX or a SAX filter, so you would have to drop down to the XMLReader interface, where you would have to do something similar to the above. ( The JAVA parser api has an optional feature to tell it to try to continue after a fatal error, so it might be possible to fix it and continue, but I don't know if that can be done in Python. If it can, it's probably a more complicated procedure than the above. Are there any hooks in lxml than can do this ? )

Collectives™ on Stack Overflow

How to parse broken XML in Python?

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related