4

A sever I can't influence sends very broken XML.

Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86 (9 bytes) in a file that's declared as utf-8 with no DTD.

I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encountering this. I'd like to avoid BeautifulSoup for speed reasons.

What else is there?

5
  • 1
    Not clear: Are you saying that the server is sending the XML header: "<?xml version="1.0" encoding="UTF-8"?>" with xml somewhere containing: "&acirc\x98\x86" ? Commented Aug 26, 2010 at 18:55
  • Exactly. I don't know at what point the server encodes the entities, so I'm reluctant to just reverse it before even calling a parser. Commented Aug 26, 2010 at 19:48
  • lxml.html parser ( and probably Beautiful Soup ) can PARSE that broken XML, but they can't fix it so you get a Unicode WHITE STAR out (and I don't think you can fix it with a SAX entity handler either). You'll probably have to fix the byte stream using re.sub and htmlentitydefs before passing it to the parser. ( I wonder what sort of process can be writing such broken output ? one part of the process must think it's writing Latin-1 HTML and another thinks it's producing UTF8 XML! ) Commented Aug 26, 2010 at 20:59
  • It's an easy mistake to make in a language without native unicode support. AFAIK the server is written in PHP... Commented Aug 26, 2010 at 22:27
  • 1
    If the XML is not well formed, i.e. broken, then get whoever is generating it to generate it correctly. Similar if it does not conform to the DTD or schema which it is supposed to, return to sender. Commented Jan 22, 2013 at 10:46

2 Answers 2

2

BeautifulSoup is your best bet in this case. I suggest profiling before ruling out BeautifulSoup altogether.

Sign up to request clarification or add additional context in comments.

3 Comments

"[...] you don't really care what HTML is supposed to look like. Neither does this parser. " :-)
I did, and it's orders of magnitude slower than the lxml.objectify I'm using now (accepting a few broken strings in the UI)
@Tobias could you post some actual results and version numbers? would be useful for reference to others. Yes I know this is an old question - just in case. :)
0

Maybe something like:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )

3 Comments

You need to remove the five XML entities from htmlentitydefs to avoid unescaping <>.
As I said I'm reluctant to do this because it looks like the server only entity-encodes the contents of one specific tag.
The problem is that I don't think you can do it from SAX or a SAX filter, so you would have to drop down to the XMLReader interface, where you would have to do something similar to the above. ( The JAVA parser api has an optional feature to tell it to try to continue after a fatal error, so it might be possible to fix it and continue, but I don't know if that can be done in Python. If it can, it's probably a more complicated procedure than the above. Are there any hooks in lxml than can do this ? )

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.