5

I'm trying to find an efficient approach to compare two XML files and handle the differences in a python script. The scenario is that I have two XML files similar to the following on:

<?xml version="1.0" encoding="UTF-8"?> 
<garage> 
    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>10000</price>
    </car> 
    <car> 
        <color>blue</color> 
        <size>big</size> 
        <price>10000</price>

    <!-- [...] -->

    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>11000</price>
    </car> 
    </car> 
</garage>

Those XML files contain thousands of small objects. The files themselves have a size of about 5 MB. The tricky thing is that only a very few entries of the two files differ and that I only need to handle the information that differs. With other words: I need to efficiently (!) find out, which of the entries changed or have been added. Unfortunately the XML files also contain some optional entries that I don't care about at all.

I considered the following solutions:

  1. Parse both files into a DOM tree and compare them in a loop
  2. Parse both files into sets and use operators like set.difference
  3. Try to hand some of the processing over to some linux tools like grep and diff

Does anybody here have experiences with the performance of such approaches and can guide me a direction to walk into?

5
  • By using Google and trying pypi.python.org/pypi/xmldiff Commented Oct 13, 2013 at 8:27
  • 1
    @user2799617: that project is very outdated. But if you got that working, perhaps you want to post an answer showing us how? Commented Oct 13, 2013 at 8:28
  • I would have tried option 3 first. Commented Oct 13, 2013 at 8:40
  • 1
    I would say, first parse the XML and "normalize" it into a sequence of text lines each describing one car, in a format such that the same car will consistently result in the same string. So remove the stuff you don't care about and present the car-related elements in a fixed order for each car. Then use difflib to get a diff of that. Commented Oct 13, 2013 at 8:49
  • It depends partly on what kind of output you want. Steve's suggestion is certainly the easiest one. Commented Oct 13, 2013 at 9:04

1 Answer 1

1

Create a cached intermediate format that only has the stuff you care about comparing. When comparing two files, A.xml & B.xml, compare their A.cached and B.cached instead, generating them if missing and removing on file change (or re-generating based on timestamp etc). The generation cost will be amortized over multiple comparisons, and you will not be iterating over unnecessary entries.

The format of ".cached" really depends on what you care about and how much information/context you need. It could perhaps even potentially have a binary representation

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for all your suggestions. After reading a lot I came to the conclusion that I'll try to construct a stylesheet that transforms the XML into a flat text file with all the important values. I'd hope that the implementation of the tools around XML is way more efficient than any implementation I can think of. I'll let you knwo whether it worked as soon as it's finished...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.