5

How to check if two XML files are equivalent?

For example, the two XML files are the same even though the ordering is different. I need to check if the two XML files content the same textual info disregarding the order.

<a>
   <b>hello</b>
   <c><d>world</d></c>
</a>

<a>
   <c><d>world</d></c>
   <b>hello</b>
</a>

Are there tools for this out there?

3
  • 2
    Actually, they are not the same, as XML usually also contains the order of elements. So if you want to define that as “the same”, you will probably have to write an own comparison function. Commented Oct 20, 2010 at 13:01
  • Well, those files may be semantically equivalent - or they may not be. Are you certain that in your situation ordering isn't important? It's important in plenty of XML files. Commented Oct 20, 2010 at 13:01
  • 1
    @poke and @Jon : Thanks for the comment, I changed the title from 'the same' to 'equivalent'. Commented Oct 20, 2010 at 13:03

3 Answers 3

11

It all depends on your definition of "equivalent".

Assuming you really only care about the text nodes (for example: the d tags in your example do not even matter, you only care about the content word), you can just make a set of the text nodes of each document, and compare the sets. Using lxml, this could look like:

from lxml import etree

tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')

print set(tree1.getroot().itertext()) == set(tree2.getroot().itertext())

You might even want to ignore whitespace nodes, doing something like:

set(i for i in tree.getroot().itertext() if i.strip())

Note that using sets means you will NOT take into account how many times certain pieces of text occur in the document (this might be what you want, it might not). If the order is not important, but the number of times something occurs is, you could use a dictionary instead of a set, and keep track of the number of occurences (eg. with collections.defaultdict() or collections.Counter in python 2.7)

But if it is only the order of the direct child elements of the root element (in your case, children of the a element) that may be ignored, and everything inside those elements really counts, you would need another approach. You could for example do xml canonicalization on each child element to get a normalized version of each child (again, I don't know if this is normalized enough for your needs).

from lxml import etree

tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')

set1 = set(etree.tostring(i, method='c14n') for i in tree1.getroot())
set2 = set(etree.tostring(i, method='c14n') for i in tree2.getroot())

print set1 == set2

Note: to keep the example simpler, I've used the development version of lxml, in older versions, there is no method='c14n' for etree.tostring(), only a c14n() method on the ElementTree, that writes to a file-like object. So to get it working there, you'd have to copy each element to a tree of its own, and use a StringIO() object as a dummy file)

Also, this way of doing it is probably not recommended with very large files.

But again: a BIG WARNING: you really have to know what you need as "equivalent", and create your own solution based on that knowledge!

Sign up to request clarification or add additional context in comments.

Comments

1

Ordering is important in XML, so the two files you provided are different. Normally you could normalize the XML and then simply compare the files as text, but if you want order-insensitive comparison, you will probably have to implement it yourself using one of the bazillion XML parsers out there (I would recommend lxml, by the way).

3 Comments

How to normalize the XML files? Thanks.
The idea of normalization is to remove insignificant whitespace in XML. There are plenty of external utilities to do this, possibly library support as well. Google it.
@prosseek: and in your case, you probably also want to reorder the elements in <b> in alphabetic order (or some other consistent ordering), since you want to ignore their orders.
0

my solution is below. compare all attributes,tags iteration. Some code refered from : Testing Equivalence of xml.etree.ElementTree

import xml.etree.ElementTree as ET


def elements_equal(e1, e2):
    if e1.tag != e2.tag: 
        return False
    if e1.text != e2.text: 
        if  e1.text!=None and e2.text!=None :
            return False
    if e1.tail != e2.tail:
        if e1.tail!=None and e2.tail!=None:
            return False
    if e1.attrib != e2.attrib: 
        return False
    if len(e1) != len(e2): 
        return False
    return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))



def is_two_xml_equal(f1, f2):
    tree1 = ET.parse(f1)
    root1 = tree1.getroot()
    tree2 = ET.parse(f2)
    root2 = tree2.getroot()
    return elements_equal(root1,root3)

f1 = '2.xml'
f2 = '1.xml'
print(is_two_xml_equal(f1, f2))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.