Python xml.etree.ElemenTree, getting HTML entities

Question

I am trying to analyze xml data, and encountered an issue with regard to HTML entities when I use

import xml.etree.ElementTree as ET
tree = ET.parse(my_xml_file)
root = tree.getroot()
for regex_rule in root.findall('.//regex_rule'):
  print(regex_rule.get('input')) #this ".get()" method turns &lt; into <, but I want to get &lt; as written
  print(regex_rule.get('input') == "(?&lt;!\S)hello(?!\S)") #prints out false because ElementTree's get method turns &lt; into < , is that right?

And here is the xml file contents:

<rules>
<regex_rule input="(?&lt;!\S)hello(?!\S)" output="world"/>
</rules>

I would appreciate if anybody can direct me to getting the string as is from the xml attribute for the input, without converting

&lt;

into

atomicinf · Accepted Answer · 2013-10-24 04:12:07Z

2

xml.etree.ElementTree is doing exactly the standards-compliant thing, which is to decode XML character entities with the understanding that they do in fact encode the referenced character and should be interpreted as such.

The preferred course of action if you do need to encode the literal < is to change your input file to use &lt; instead (i.e. we XML-encode the &).

If you can't change your input file format then you'll probably need to use a different module, or write your own parser: xml.etree.ElementTree translates entities well before you can do anything meaningful with the output.

answered Oct 24, 2013 at 4:12

atomicinf

3,76621 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GiantEnemyCrab Over a year ago

Thanks for your input. It seems that I am out of luck, using xml.etree.ElementTree. I will resort to some kind of other creative solutions. (I am in an environment where I can't easily install other modules like lxml, etc). I am basically checking rules that exist in xml and json files. In json files, there is no html entity, and there shouldn't be. I will accept your response as the answer. Thank you.

Collectives™ on Stack Overflow

Python xml.etree.ElemenTree, getting HTML entities

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related