5

I'm trying to merge two xml files. The files contain the same overall structure but the details are different.

file1.xml:

<book>
    <chapter id="113">
        <sentence id="1">
            <word id="128160">
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
             </sentence>
             <sentence id="2">
            <word id="128162">
                <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
             </sentence>
        </chapter>
</book>

file2.xml:

<book>
    <chapter id="113">
        <sentence id="1">
            <word id="128160">
            <concept English="joke"/>
            </word>
            <word id="128161">
                <concept English="romance"/>
            </word>
             </sentence>
             <sentence id="2">
            <word id="128162">
                <concept English="happiness"/>
            </word>
             </sentence>
        </chapter>
</book>

The desired output is :

<book>
    <chapter id="113">
        <sentence id="1">
            <word id="128160">
                    <concept English="joke"/>
                    <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <concept English="romance"/>
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
             </sentence>
             <sentence id="2">
            <word id="128162">
                <concept English="happiness"/>
                <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
             </sentence>
        </chapter>
</book>

Okay, I tried doing that in path, but i didnt get the desired output:

import os, os.path, sys
import glob
from xml.etree import ElementTree

output = open('merge.xml','w')
files="sample"
xml_files = glob.glob(files +"/*.xml")
xml_element_tree = None
for xml_file in xml_files:
        data = ElementTree.parse(xml_file).getroot()
        # print ElementTree.tostring(data)
        for word in data.iter('word'):
            if xml_element_tree is None:
                xml_element_tree = data 
                insertion_point = xml_element_tree.findall("book/chapter/sentence/word/*")
            else:
                insertion_point.extend(word) 
if xml_element_tree is not None:
        print>>output, ElementTree.tostring(xml_element_tree)

please, any help

3 Answers 3

1

A way I've done something similar in the past is to create an xml document then append the values your looking for. I don't believe there is a way to "merge" them

xml = ET.fromstring("<book></book>")
document = ET.parse(tempFile)
childNodeList = document.findall(xpathQuery)
for node in childNodeList: 
   xml.append(node)
Sign up to request clarification or add additional context in comments.

2 Comments

Ok, but how to get the right the xpath query in my files? how to compare if the two files contain the same word id then copy and create a new xml file?
well. those are separate questions. you asked how to merge two xml files. for your xpath query, i'd look here: docs.python.org/2/library/… In terms of your word-id comparison, you'll have to execute the xpath query to get a list of matching nodes, iterate over it and compare the word id and if its not in your new xml, then add it. That part is really an algorithm question...
1

Here's a solution. Start with an empty merged document and then as you enumerate the files, add elements you can't find into the merged document. You could generalize this but here's a first cut:

import lxml.etree
merged = lxml.etree.Element('book')
for xml_file in xml_files:
    for merge_chapter in lxml.etree.parse(xml_file):
        try:
            chapter = merged.xpath('chapter[@id=%s]' % merge_chapter.get('id'))[0]
            for merge_sentence in merge_chapter:
                try:
                    sentence = chapter.xpath('sentence[@id=%s]' % merge_sentence.get('id'))[0]
                    for merge_word in merge_sentence:
                        try:
                            word = sentence.xpath('word[@id=%s]' % merge_word.get('id'))[0]
                            for data in merge_word:
                                try:
                                    word.xpath(data.tag)[0]
                                except IndexError:
                                    # add newly discovered word data
                                    word.append(data)
                        except IndexError:
                            # add newly discovered word
                            sentence.append(merge_word)
                except IndexError:
                    # add newly discovered sentence
                    chapter.append(merge_sentence)
        except IndexError:
            # add newly discovered chapter
            merged.append(merge_chapter)

4 Comments

hi, Thanks for your help, I tried to run the code but i came with this error : AttributeError: 'ElementTree' object has no attribute 'element'
Im working in xml element tree model. which model the code is running in?
oops... my bad. I switched up lxml and ElementTree there. lxml has a great xpath parser and i favor it over ElementTree. I've made an edit.
is using exceptions as control flow operations a good thing ?
0

Given that you want to merge File2 into File1, you can loop over all of the elements in File2 then copy the attributes from the File2's element into File1's element.

I have to do something similar like this on a project that I'm working on now. Here is my current solution which should work under Python 2.7.

Note that I further added to the requirements copying attributes between common nodes. You will see that I added the following attributes to A:

  • drums='Neil'
  • bass='Geddy'

Then to B I added:

  • guitar='Alex'

The final merged document has all three members of the power trio.

I also added <sentance id='3'/> to demonstrate that order of elements no longer matters.

#!/usr/bin/python
from lxml import etree 
from copy import deepcopy
import lxml

xmlA='''
<book>
    <chapter id="113">

        <sentence id="1" drums='Neil'>
            <word id="128160" bass='Geddy'>
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
        </sentence>

        <sentence id="2">
            <word id="128162">
                <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
        </sentence>

    </chapter>
</book>
'''

xmlB='''
<book>
    <chapter id="113">

        <sentence id="3">
            <word id="128168">
                <concept English="sadness"/>
            </word>
        </sentence>

        <sentence id="1">
            <word id="128160">
                <concept English="joke"/>
            </word>
            <word id="128161">
                <concept English="romance"/>
            </word>
        </sentence>

        <sentence id="2" guitar='Alex'>
            <word id="128162">
                <concept English="happiness"/>
            </word>
        </sentence>


    </chapter>
</book>
'''

import re
from copy import deepcopy

##
#   @brief  Translates the relational xpath to an explicit xpath.
#   In the XML examples above, getpath will return the following for 
#   <sentance id='1'/>:
#       - xmlA = /book/chapter/sentance[1]
#       - xmlb = /book/chapter/sentance[2]
#
#   A path that is explicit in both document would be:
#       - xmlA = /book/chapter/sentance[@id='1']
#       - xmlb = /book/chapter/sentance[@id='1']
#
def convertXpath(element):
    newPath = ''
    tree    = element.getroottree()
    path    = tree.getpath(element).split('/')
    root    = tree.getroot()

    for p in path:
        if p == '':
            continue

        if re.search('\[[0-9]*\]', p):

            # Get the element at this path
            #
            node = root.xpath(newPath+'/'+p)[0]
            id=node.get('id')

            p=re.sub('\[[0-9]*\]','', p)
            newPath += '/'+p+"[@id='"+id+"']"

        else:
            newPath+='/'+p

    return newPath



def mergeXml(a,b):

    for node in a.nodes():
        path = convertXpath(node)

        # find the element in the other document
        #
        elements =  b.root.xpath(path)

        for e in elements:
            for name, value in node.items():
                if name == 'id':
                    continue
                e.set(name,value)

        if len(elements) == 0:
            # Add the node to other document
            #
            newElement = deepcopy(node)

            # Find the path to the parent
            #
            parent = node.getparent()
            path = convertXpath(parent)

            bParent = b.root.xpath(path)[0]
            bParent.append(newElement)

class XmlDoc:
    def __init__(self, xml):
        self.root = etree.fromstring(xml)
        self.tree = self.root.getroottree()

    def __str__(self):
        return etree.tostring(self.root, pretty_print=True)

    def nodes(self):
        return self.root.iter('*')



if __name__ == '__main__':
    a = XmlDoc(xmlA)
    b = XmlDoc(xmlB)

    mergeXml(a,b)
    print b

This yields the following output:

<book>
    <chapter id="113">

        <sentence id="3">
            <word id="128168">
                <concept English="sadness"/>
            </word>
        </sentence>

        <sentence id="1" drums="Neil">
            <word id="128160" bass="Geddy">
                <concept English="joke"/>
            <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <concept English="romance"/>
            <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
        </sentence>

        <sentence id="2" guitar="Alex">
            <word id="128162">
                <concept English="happiness"/>
            <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
        </sentence>


    </chapter>
</book>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.