0

I have a large xml file that contains the details of image annotations. A sample of the same is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="253" left="166" width="56" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="229" width="61" height="21">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="290" width="58" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="361" width="56" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="417" width="63" height="22">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="486" width="63" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="504" left="329" width="51" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

I want this file to be split based on their tag names. This file has two tags viz - ScoreBoard and Perimeter. I want to create two different xmls out of this for each tag. The desired output would be as follows:

for ScoreBoard-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="504" left="329" width="51" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

For Perimeter-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="253" left="166" width="56" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="229" width="61" height="21">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="290" width="58" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="361" width="56" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="417" width="63" height="22">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="486" width="63" height="20">
                <label>Perimeter-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

I have 350-400 such tags. How can I split them into individual files.

New Example:

<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-SVT" color="#f9e99c"/>
        <tag name="Perimeter-Vivon" color="#032585"/>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
        <tag name="Perimeter-StarSports" color="#12dadd"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0011.jpg">
            <box top="505" left="327" width="56" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
            <box top="218" left="387" width="67" height="24">
                <label>Perimeter-SVT</label>
            </box>
        </image>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0005.jpg">
            <box top="254" left="159" width="64" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="225" width="61" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="285" width="63" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="253" left="357" width="58" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="424" width="56" height="25">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="256" left="484" width="65" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="507" left="326" width="58" height="26">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0009.jpg">
            <box top="249" left="400" width="59" height="29">
                <label>Perimeter-StarSports</label>
            </box>
        </image>
    </images>
</dataset>
1
  • You can use XSLT, it is a great way to do that. You could create a template to get the tags you want. Info Commented Nov 14, 2017 at 8:42

2 Answers 2

1

One way would be to take the original XML, determine the <tags> in use, then make copies of the XML and remove all tags that don't match:

import xml.etree.ElementTree as ET
import copy

img_xml = """<?xml version="1.0" encoding="UTF-8"?>
<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box top="253" left="166" width="56" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="229" width="61" height="21">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="290" width="58" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="361" width="56" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="417" width="63" height="22">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="486" width="63" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="504" left="329" width="51" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>
"""

root = ET.fromstring(img_xml)
tag_names = [tag.attrib['name'] for tag in root.find('tags')]

for tag_name in tag_names:
    root_copy = copy.deepcopy(root)

    # First remove unwanted tag
    for tag in root_copy.find('tags'):
        if tag.attrib['name'] != tag_name:
            tag.clear()

    # Now remove unwanted box
    for box in root_copy.findall("./images/image/box"):
        if box[0].text != tag_name:
            box.clear()

    ET.ElementTree(root_copy).write('{}.xml'.format(tag_name))

Giving you two output XML files:

Perimeter-Vivon.xml

<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag /><tag color="#032585" name="Perimeter-Vivon" />
    </tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box height="24" left="166" top="253" width="56">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="21" left="229" top="255" width="61">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="23" left="290" top="254" width="58">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="20" left="361" top="254" width="56">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="22" left="417" top="254" width="63">
                <label>Perimeter-Vivon</label>
            </box>
            <box height="20" left="486" top="254" width="63">
                <label>Perimeter-Vivon</label>
            </box>
            <box /></image>
    </images>
</dataset>        

ScoreBoard-Vivon.xml

<dataset>
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag color="#bf5786" name="ScoreBoard-Vivon" />
        <tag /></tags>
    <images>
        <image file="/var/www/html/beacon.com/resources/videos/ST2_20170812/ST_2_20170812-0005.jpg">
            <box /><box /><box /><box /><box /><box /><box height="29" left="329" top="504" width="51">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>
Sign up to request clarification or add additional context in comments.

1 Comment

Many thanks. The code works perfectly for the example given. However, I face two problems. 1. I don't know how to get rid of the unwanted < /box> tags in the second file. Infact we have same number of box tags as the number of annotations and 2. My original file has 1000's of images one after the other. When I use the same example, all those image tags are retained whether they have a box or not. I will also update the example.
1

The following (XSLT 2.0) stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xsl:template match="//dataset/tags">
      <xsl:for-each select="./tag">
            <xsl:variable name="tagName" select="@name" />

                <xsl:result-document method="xml" href="{$tagName}.xml">
                    <dataset>    
                        <xsl:copy-of select="/dataset/name"/>
                        <xsl:copy-of select="/dataset/comment"/>
                        <tags>
                            <xsl:copy-of select="/dataset/tags/tag[./@name = $tagName]"/>
                        </tags>
                        <images>
                        <xsl:for-each select="/dataset/images/image[./box/label/text() = $tagName]">
                            <image> 
                                <xsl:copy-of select="./@file"/>
                                <xsl:copy-of select="./box[./label[./text() = $tagName]]"/>
                            </image>
                        </xsl:for-each>
                        </images>
                    </dataset>
                </xsl:result-document>                              

      </xsl:for-each>
    </xsl:template>     
</xsl:stylesheet>

When applied to your input produces the following results:

Perimeter-SVT.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-SVT" color="#f9e99c"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0011.jpg">
            <box top="218" left="387" width="67" height="24">
                <label>Perimeter-SVT</label>
            </box>
        </image>
    </images>
</dataset>

Perimeter-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-Vivon" color="#032585"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0005.jpg">
            <box top="254" left="159" width="64" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="255" left="225" width="61" height="20">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="285" width="63" height="23">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="253" left="357" width="58" height="24">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="254" left="424" width="56" height="25">
                <label>Perimeter-Vivon</label>
            </box>
            <box top="256" left="484" width="65" height="23">
                <label>Perimeter-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

ScoreBoard-Vivon.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="ScoreBoard-Vivon" color="#bf5786"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0011.jpg">
            <box top="505" left="327" width="56" height="29">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0005.jpg">
            <box top="507" left="326" width="58" height="26">
                <label>ScoreBoard-Vivon</label>
            </box>
        </image>
    </images>
</dataset>

Perimeter-StarSports.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataset xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <name>dataset containing bounding box labels on images</name>
    <comment>created by BBTag</comment>
    <tags>
        <tag name="Perimeter-StarSports" color="#12dadd"/>
    </tags>
    <images>
        <image file="/var/www/html/tamsports.com/resources/videos/STAR_SPORTS_2_20170812/STAR_SPORTS_2_20170812-0009.jpg">
            <box top="249" left="400" width="59" height="29">
                <label>Perimeter-StarSports</label>
            </box>
        </image>
    </images>
</dataset>

4 Comments

thank you. Your solution also worked. But I wanted the output as separate files. Thank you again for your help.
You are welcome. The xslt produces separate files for each tag, named {$tagName}.xml.
Oh...how do i then get them separated. I tried the code given at gist.github.com/anupamshakya7/11285898....I am sorry...i don't have a clue on working with xml files....could u please add some more guidance
I have updated my answer in order to handle multiple image elements, as provided on your updated example. Bear in mind that you'll need an XSLT 2.0 processor for handling this transformation. You can always ask a new question on handling XSLT related to the technology you are using.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.