I'm trying to find an efficient approach to compare two XML files and handle the differences in a python script. The scenario is that I have two XML files similar to the following on:
<?xml version="1.0" encoding="UTF-8"?>
<garage>
<car>
<color>red</color>
<size>big</size>
<price>10000</price>
</car>
<car>
<color>blue</color>
<size>big</size>
<price>10000</price>
<!-- [...] -->
<car>
<color>red</color>
<size>big</size>
<price>11000</price>
</car>
</car>
</garage>
Those XML files contain thousands of small objects. The files themselves have a size of about 5 MB. The tricky thing is that only a very few entries of the two files differ and that I only need to handle the information that differs. With other words: I need to efficiently (!) find out, which of the entries changed or have been added. Unfortunately the XML files also contain some optional entries that I don't care about at all.
I considered the following solutions:
- Parse both files into a DOM tree and compare them in a loop
- Parse both files into sets and use operators like set.difference
- Try to hand some of the processing over to some linux tools like grep and diff
Does anybody here have experiences with the performance of such approaches and can guide me a direction to walk into?
difflibto get a diff of that.