Efficiently compare two XML files in python

Question

I'm trying to find an efficient approach to compare two XML files and handle the differences in a python script. The scenario is that I have two XML files similar to the following on:

<?xml version="1.0" encoding="UTF-8"?> 
<garage> 
    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>10000</price>
    </car> 
    <car> 
        <color>blue</color> 
        <size>big</size> 
        <price>10000</price>

    <!-- [...] -->

    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>11000</price>
    </car> 
    </car> 
</garage>

Those XML files contain thousands of small objects. The files themselves have a size of about 5 MB. The tricky thing is that only a very few entries of the two files differ and that I only need to handle the information that differs. With other words: I need to efficiently (!) find out, which of the entries changed or have been added. Unfortunately the XML files also contain some optional entries that I don't care about at all.

I considered the following solutions:

Parse both files into a DOM tree and compare them in a loop
Parse both files into sets and use operators like set.difference
Try to hand some of the processing over to some linux tools like grep and diff

Does anybody here have experiences with the performance of such approaches and can guide me a direction to walk into?

@user2799617: that project is very outdated. But if you got that working, perhaps you want to post an answer showing us how? — Martijn Pieters
– Martijn Pieters, Commented Oct 13, 2013 at 8:28
I would say, first parse the XML and "normalize" it into a sequence of text lines each describing one car, in a format such that the same car will consistently result in the same string. So remove the stuff you don't care about and present the car-related elements in a fixed order for each car. Then use difflib to get a diff of that. — Steve Jessop
– Steve Jessop, Commented Oct 13, 2013 at 8:49
It depends partly on what kind of output you want. Steve's suggestion is certainly the easiest one. — Lennart Regebro
– Lennart Regebro, Commented Oct 13, 2013 at 9:04

Preet Kukreti · Accepted Answer · 2013-10-13 10:39:15Z

1

Create a cached intermediate format that only has the stuff you care about comparing. When comparing two files, A.xml & B.xml, compare their A.cached and B.cached instead, generating them if missing and removing on file change (or re-generating based on timestamp etc). The generation cost will be amortized over multiple comparisons, and you will not be iterating over unnecessary entries.

The format of ".cached" really depends on what you care about and how much information/context you need. It could perhaps even potentially have a binary representation

answered Oct 13, 2013 at 10:39

Preet Kukreti

8,66731 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Norbert Over a year ago

Thanks for all your suggestions. After reading a lot I came to the conclusion that I'll try to construct a stylesheet that transforms the XML into a flat text file with all the important values. I'd hope that the implementation of the tools around XML is way more efficient than any implementation I can think of. I'll let you knwo whether it worked as soon as it's finished...

Collectives™ on Stack Overflow

Efficiently compare two XML files in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related