23

I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:

<Style name="admin-5678">
    <Rule>
      <Filter>[admin_level]='5'</Filter>
      &maxscale_zoom11;
    </Rule>
</Style>

There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:

    #!/usr/bin/python
    ##
    ## Where's the entity support as documented at:
    ## http://effbot.org/elementtree/elementtree-xmlparser.htm
    ## In Python 2.7.1+ ?
    ##
    from pprint     import pprint
    from xml.etree  import ElementTree
    from cStringIO  import StringIO

    parser = ElementTree.ElementTree()
   #parser.entity["maxscale_zoom11"] = unichr(160)
    testf = StringIO('<foo>&maxscale_zoom11;</foo>')
    tree = parser.parse(testf)
   #tree = parser.parse(testf,"XMLParser")
    for node in tree.iter('foo'):
        print node.text

Which depending on how you adjust the comments gives:

xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5

or

AttributeError: 'ElementTree' object has no attribute 'entity'

or

AttributeError: 'str' object has no attribute 'feed'           

For those curious the XML is from the OpenStreetMap's mapnik project.

3
  • Possibly related question: stackoverflow.com/questions/2524299/entity-references-and-lxml Commented Aug 30, 2011 at 1:11
  • Not related, because in that case the entity is actually defined. Remove the entity definition and you're back to my question. Commented Aug 30, 2011 at 6:21
  • fyi - someone may want to fix the /usr/bin/python to /usr/bin/env python as the shebang line is wrong for most systems. Commented Nov 12, 2012 at 2:41

2 Answers 2

16

As @cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.

I finally got it working. Quoted from this Q&A.

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)
Sign up to request clarification or add additional context in comments.

2 Comments

This only applies to HTML documents though, right? The notion of 'DOCTYPE' processing instructions does not apply to "simple XML files " as the OP is apparently dealing with.
@FrerichRaabe, Sorry, admittedly I did not test it on XML docs. That answer was quoted from here, and was hoping it would be helpful. That original Q&A link contains another answer that may or may not help in your situation.
14

I'm not sure if this is a bug in ElementTree or what, but you need to call UseForeignDTD(True) on the expat parser to behave the way it did in the past.

It's a bit hacky, but you can do this by creating your own instance of ElementTree.Parser, calling the method on it's instance of xml.parsers.expat, and then passing it to ElementTree.parse():

from xml.etree  import ElementTree
from cStringIO  import StringIO


testf = StringIO('<foo>&moo_1;</foo>')

parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity['moo_1'] = 'MOOOOO'

etree = ElementTree.ElementTree()

tree = etree.parse(testf, parser=parser)

for node in tree.iter('foo'):
    print node.text

This outputs "MOOOOO"

Or using a mapping interface:

from xml.etree  import ElementTree
from cStringIO  import StringIO

class AllEntities:
    def __getitem__(self, key):
        #key is your entity, you can do whatever you want with it here
        return key

testf = StringIO('<foo>&moo_1;</foo>')

parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = AllEntities()

etree = ElementTree.ElementTree()

tree = etree.parse(testf, parser=parser)

for node in tree.iter('foo'):
    print node.text

This outputs "moo_1"

A more complex fix would be to subclass ElementTree.XMLParser and fix it there.

8 Comments

A bit icky as as you say, but thanks. Is there any way to avoid having to predefine the entities (e.g. &moo_2).
@Bryce: being predefined is the point of entities, no? Nevertheless: you could set parser.entity to your own dictionary-like object. As a simple example, you could do parser.entity = collections.defaultdict(str) to have all undefined entities replaced by an empty string.
This won't work in Python 3 with cpython, where the C versions (formerly cElementTree are being used instead).
I'm not sure if this is possible at all in Python 3 currently. Looking at the the docs I see the following method signature xml.etree.ElementTree.XMLParser(html=0, target=None, encoding=None) but the docs say Element structure builder for XML source data, based on the expat parser. html are predefined HTML entities. This flag is not supported by the current implementation. It looks like element tree is getting more strict and if your entities aren't defined, then it's not valid and won't be parsed.
I had this working in 2.7 in an overridden ElementTree XMLParser, but I can no longer extend that in 3.5 as you point out, because inheriting from the cElementTree parser is not possible. Not sure if I should be pushing my content through a custom codec before parsing, or what. Is there a standard answer for python 3?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.