0

I am using xsl file to merge multiple xml files. The number of files is around 100 and each file has 4000 nodes. The example xml and xsl are available here in this SO question

My xmlmerge.py is as follows:

import lxml.etree as ET
import argparse
import os
ap = argparse.ArgumentParser()
ap.add_argument("-x", "--xmlreffile", required=True, help="Path to list of xmls")
ap.add_argument("-s", "--xslfile", required=True, help="Path to the xslfile")
args = vars(ap.parse_args())    
dom = ET.parse(args["xmlreffile"])
xslt = ET.parse(args["xslfile"])
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(ET.tostring(newdom, pretty_print=True))   

I am writing the output of the python to a xmlfile...so my code to run the python script is as follows:

python xmlmerge.py --xmlreffile ~/Documents/listofxmls.xml --xslfile ~/Documents/xslfile.xsl

For 100 files when I print the output on a console, it takes around 120 minutes however, if I try to save the same output in a xml file

python xmlmerge.py --xmlreffile ~/Documents/listofxmls.xml --xslfile ~/Documents/xslfile.xsl >> ~/Documents/mergedxml.xml

This takes around 3 days but yet the merge is not over. I was not sure if the machine is hung and hence tried with just 8 files on a different machine, and it had taken more than 4 hours but still the merge is not complete. I don't know why it takes so much of time when I write to the file but not when I am printing on to the console. Can someone guide me?

I am using Ubuntu 14.04, python 2.7.

0

1 Answer 1

0

Why don't you make a multi-processing version of your script. There is several ways you could do it but, from what I understand, here is what I would do

list = open("listofxmls.xml","r")# assuming this gives you a list of files (adapt if necessary)

yourFunction(xml):
    steps 
    of your
    parse funct
    return(ET.tostring(newdom, pretty_print=True))

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4) # number of threads (adapt depending on the task and your CPU)
mergedXML = pool.map(yourFunction,list) # execute the function in parallel
pool.close()
pool.join()

then, save your mergedXML as you like.

Hope it helps or, at least, lead u in the right direction

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.