16

I have several xml files. They all have the same structure, but were splitted due to file size. So, let's say I have A.xml, B.xml, C.xml and D.xml and want to combine/merge them to combined.xml, using a command line tool.

A.xml

<products>
    <product id="1234"></product>
    ...
</products>

B.xml

<products>
  <product id="5678"></product>
  ...
</products>

etc.

2
  • 1
    try adding the tag xmlstarlet to increase the number of people looking at your question. Also tag for OS, windows, unix/linux, or ?? Good luck. Commented Jan 25, 2012 at 20:02
  • See also stackoverflow.com/questions/80609/merge-xml-documents Commented Mar 21, 2013 at 11:29

5 Answers 5

23

High-tech answer:

Save this Python script as xmlcombine.py:

#!/usr/bin/env python
import sys
from xml.etree import ElementTree

def run(files):
    first = None
    for filename in files:
        data = ElementTree.parse(filename).getroot()
        if first is None:
            first = data
        else:
            first.extend(data)
    if first is not None:
        print(ElementTree.tostring(first))

if __name__ == "__main__":
    run(sys.argv[1:])

To combine files, run:

python xmlcombine.py ?.xml > combined.xml

For further enhancement, consider using:

  • chmod +x xmlcombine.py: Allows you to omit python in the command line

  • xmlcombine.py !(combined).xml > combined.xml: Collects all XML files except the output, but requires bash's extglob option

  • xmlcombine.py *.xml | sponge combined.xml: Collects everything in combined.xml as well, but requires the sponge program

  • import lxml.etree as ElementTree: Uses a potentially faster XML parser

Sign up to request clarification or add additional context in comments.

8 Comments

In MacOS Python 2.6 ... I'm getting _ElementInterface instance has no attribute 'extend'. Which version of python are you using?
This was probably Python 2.7 on Linux. You could work around that by iterating over the children and adding each one manually; or perhaps lxml will work for you.
@Vik: The question mark is a shell globbing character, standing for any single character. If your xml files aren't named like the ones in the question, then ?.xml fails to match anything, so the shell passes it as-is to the program, which complains because there's no file with that name. Try passing it the real names of your files; just make sure that the output file doesn't get passed in as an input.
@InêsMartins: That means one of your source xml files isn't correctly formatted. First, ensure that your output file isn't getting listed as one of the input files by mistake; if that's not the issue, then check the format of each input file. Try running xmlcombine.py on each file individually, and see which one produces the error.
This just concatenates both input files within a single top level element. It does not merge on a lower level. So, I get something like <foo><bars><bar1/><bar2/></bars><bars><bar3/><bar4/></bars></foo>. It does not merge the lower levels as in <foo><bars><bar1/><bar2/><bar3/><bar4/></bars></foo>
|
11

xml_grep

http://search.cpan.org/dist/XML-Twig/tools/xml_grep/xml_grep

xml_grep --pretty_print indented --wrap products --descr '' --cond "product" *.xml > combined.xml

  • --wrap : encloses/wraps the the xml result with the given tag. (here: products)
  • --cond : the xml subtree to grep (here: product)

Comments

1

Low-tech simple answer:

echo '<products>' > combined.xml
grep -vh '</\?products>\|<?xml' *.xml >> combined.xml
echo '</products>' >> combined.xml

Limitations:

  • The opening and closing tags need to be on their own line.
  • The files need to all have the same outer tags.
  • The outer tags must not have attributes.
  • The files must not have inner tags that match the outer tags.
  • Any current contents of combined.xml will be wiped out instead of getting included.

Each of these limitations can be worked around, but not all of them easily.

Comments

0

Merging 2 trees includes the task to identify what is identical and what should be replaced. Unfortunately, this is not obvious. There is more semantic involved than what can be inferred from the source XML documents.

Consider the case where the first document has a middle level with several elements having the same tag, but different attributes. The second document adds an attribute to that middle level to an existing element, but also another child to it. One has to know the semantic.

<params>
...
<param><name>hello</name><value>world</value></param>
...
</params>

add/merge:

<params>
   <param><name>hello</name><value>yellow submarine</value></param>
</params>

Comments

0

Another very helpful tool is yq, which aims to be jq for YAML, TOML and XML.

It can be installed via pip, the xml handling command is then called xq.

pip install yq
xq .products ?.xml --xml-output --xml-root=products > combined.xml

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.