4

I need to transform large XML files that have a nested (hierarchical) structure of the form

<Root>
   Flat XML
   Hierarchical XML (multiple blocks, some repetitive)
   Flat XML
</Root>

into a flatter ("shredded") form, with 1 block for each repetitive nested block.

The data has numerous different tags and hierarchy variations (especially in the number of tags of the shredded XML before and after the hierarchical XML), so ideally no assumption should be made about tag and attribute names, or the hierarchical level.

A top-level view of the hierarchy for just 4 levels would look something like

<Level 1>
   ...
   <Level 2>
      ...
      <Level 3>
        ...
        <Level 4>A</Level 4>
        <Level 4>B</Level 4>
        ...
      </Level 3>
      ...
   </Level 2>
   ...
</Level 1>

and the desired output would then be

<Level 1>
  ...
  <Level 2>
    ...
      <Level 3>
        ...
        <Level 4>A</Level 4>
        ...
      </Level 3>
    ...
  </Level 2>
  ...
</Level 1>

<Level 1>
  ...
  <Level 2>
    ...
      <Level 3>
        ...
        <Level 4>B</Level 4>
        ...
      </Level 3>
    ...
  </Level 2>
  ...
</Level 1>

That is, if at each level i there are Li different components, a total of Product(Li) different components will be produced (just 2 above, since the only differentiating factor is Level 4, so L1*L2*L3*L4 = 2).

From what I have seen around, XSLT may be the way to go, but any other solution (e.g., StAX or even JDOM) would do.

A more detailed example, using fictitious information, would be

<Employee name="A Name">
  <Address>123 A Street</Address>
  <Age>28</Age>
  <EmploymentHistory>
    <Employment country="US">
      <Comment>List of previous jobs in the US</Comment>
      <Jobs>3</Jobs>
      <JobDetails>
        <Job title = "Senior Developer">
          <StartDate>01/10/2001</StartDate>
          <Months>38</Months>
        </Job>
        <Job title = "Senior Developer">
          <StartDate>01/12/2004</StartDate>
          <Months>6</Months>
        </Job>
        <Job title = "Senior Developer">
          <StartDate>01/06/2005</StartDate>
          <Months>10</Months>
        </Job>
      </JobDetails>
    </Employment>
  </EmploymentHistory>
  <EmploymentHistory>
    <Employment country="UK">
      <Comment>List of previous jobs in the UK</Comment>
      <Jobs>2</Jobs>
      <JobDetails>
        <Job title = "Junior Developer">
          <StartDate>01/05/1999</StartDate>
          <Months>25</Months>
        </Job>
        <Job title = "Junior Developer">
          <StartDate>01/07/2001</StartDate>
          <Months>3</Months>
        </Job>
      </JobDetails>
    </Employment>
  </EmploymentHistory>
  <Available>true</Available>
  <Experience unit="years">6</Experience>
</Employee>

The above data should be shredded into 5 blocks (i.e., one for each different <Job> block), each of which will leave all other tags identical and just have a single <Job> element. So, given the 5 different <Job> blocks in the above example, the transformed ("shredded") XML would be

<Employee name="A Name">
  <Address>123 A Street</Address>
  <Age>28</Age>
  <EmploymentHistory>
    <Employment country="US">
      <Comment>List of previous jobs in the US</Comment>
      <Jobs>3</Jobs>
      <JobDetails>
        <Job title = "Senior Developer">
          <StartDate>01/10/2001</StartDate>
          <Months>38</Months>
        </Job>
      </JobDetails>
      <Available>true</Available>
     <Experience unit="years">6</Experience>
    </Employment>
  </EmploymentHistory>
</Employee>

<Employee name="A Name">
  <Address>123 A Street</Address>
  <Age>28</Age>
  <EmploymentHistory>
    <Employment country="US">
      <Comment>List of previous jobs in the US</Comment>
      <Jobs>3</Jobs>
      <JobDetails>
        <Job title = "Senior Developer">
          <StartDate>01/12/2004</StartDate>
          <Months>6</Months>
        </Job>
      </JobDetails>
      <Available>true</Available>
     <Experience unit="years">6</Experience>
    </Employment>
  </EmploymentHistory>
</Employee>

<Employee name="A Name">
  <Address>123 A Street</Address>
  <Age>28</Age>
  <EmploymentHistory>
    <Employment country="US">
      <Comment>List of previous jobs in the US</Comment>
      <Jobs>3</Jobs>
      <JobDetails>
        <Job title = "Senior Developer">
          <StartDate>01/06/2005</StartDate>
          <Months>10</Months>
        </Job>
      </JobDetails>
      <Available>true</Available>
     <Experience unit="years">6</Experience>
    </Employment>
  </EmploymentHistory>
</Employee>

<Employee name="A Name">
  <Address>123 A Street</Address>
  <Age>28</Age>
  <EmploymentHistory>
    <Employment country="UK">
      <Comment>List of previous jobs in the UK</Comment>
      <Jobs>3</Jobs>
      <JobDetails>
        <Job title = "Junior Developer">
          <StartDate>01/05/1999</StartDate>
          <Months>25</Months>
        </Job>
      </JobDetails>
      <Available>true</Available>
     <Experience unit="years">6</Experience>
    </Employment>
  </EmploymentHistory>
</Employee>

<Employee name="A Name">
  <Address>123 A Street</Address>
  <Age>28</Age>
  <EmploymentHistory>
    <Employment country="UK">
      <Comment>List of previous jobs in the UK</Comment>
      <Jobs>3</Jobs>
      <JobDetails>
        <Job title = "Junior Developer">
          <StartDate>01/07/2001</StartDate>
          <Months>3</Months>
        </Job>
      </JobDetails>
      <Available>true</Available>
     <Experience unit="years">6</Experience>
    </Employment>
  </EmploymentHistory>
</Employee>
19
  • XSLT is ideal for this; just to understand the question a bit more, you want to repeat the Employee information for each <Job/> element? Also, where does <Available>true</Available> come from? Commented Dec 17, 2011 at 22:40
  • This is not "flattening". A lot of data seems simply deleted in the provided result -- only the first job-details in the first country is retained in the result. This contradicts your description of the wanted flattening. Please, edit the question and specify the complete result you want from the transformation. Commented Dec 17, 2011 at 22:43
  • @dash Yes, exactly that repetition. The idea is to create "records" that associate unique values, i.e. every repetitive block (in this case, <Job/>), will have to appear as if it were the only one in the file. The <Available/> and <Experience/> blocks follow <EmploymentHistory/> and are at the same hierarchy level as <Address/>, <Age/> and <EmploymentHistory/>. Commented Dec 17, 2011 at 22:47
  • @Dimitre Nothing is deleted, there is 1 block for every <Job/> block, I just didn't write all 5 "flattened" blocks to save screen space. Commented Dec 17, 2011 at 22:49
  • 1
    @PNS: Yes. Copy/paste should do it. Certainly, if you want to keep the text to a minimum, your example can contain just two or three such blocks. When the wanted result is complete, one can run their transformation and simply compare their result to the result provided in the question. Being unable to do this really discourages a potential answerer to post an answer. Commented Dec 18, 2011 at 0:22

2 Answers 2

4

Here is a generic solution as requested:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:param name="pLeafNodes" select="//Level-4"/>

 <xsl:template match="/">
  <t>
    <xsl:call-template name="StructRepro"/>
  </t>
 </xsl:template>

 <xsl:template name="StructRepro">
   <xsl:param name="pLeaves" select="$pLeafNodes"/>

   <xsl:for-each select="$pLeaves">
     <xsl:apply-templates mode="build" select="/*">
      <xsl:with-param name="pChild" select="."/>
      <xsl:with-param name="pLeaves" select="$pLeaves"/>
     </xsl:apply-templates>
   </xsl:for-each>
 </xsl:template>

  <xsl:template mode="build" match="node()|@*">
      <xsl:param name="pChild"/>
      <xsl:param name="pLeaves"/>

     <xsl:copy>
       <xsl:apply-templates mode="build" select="@*"/>

       <xsl:variable name="vLeafChild" select=
         "*[count(.|$pChild) = count($pChild)]"/>

       <xsl:choose>
        <xsl:when test="$vLeafChild">
         <xsl:apply-templates mode="build"
             select="$vLeafChild
                    |
                      node()[not(count(.|$pLeaves) = count($pLeaves))]">
             <xsl:with-param name="pChild" select="$pChild"/>
             <xsl:with-param name="pLeaves" select="$pLeaves"/>
         </xsl:apply-templates>
        </xsl:when>
        <xsl:otherwise>
         <xsl:apply-templates mode="build" select=
         "node()[not(.//*[count(.|$pLeaves) = count($pLeaves)])
                or
                 .//*[count(.|$pChild) = count($pChild)]
                ]
         ">

             <xsl:with-param name="pChild" select="$pChild"/>
             <xsl:with-param name="pLeaves" select="$pLeaves"/>
         </xsl:apply-templates>
        </xsl:otherwise>
       </xsl:choose>
     </xsl:copy>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

When applied on the provided simplified (and generic) XML document:

<Level-1>
   ...
   <Level-2>
      ...
      <Level-3>
        ...
        <Level-4>A</Level-4>
        <Level-4>B</Level-4>
        ...
      </Level-3>
      ...
   </Level-2>
   ...
</Level-1>

the wanted, correct result is produced:

<Level-1>
   ...
   <Level-2>
      ...
      <Level-3>
         <Level-4>A</Level-4>
      </Level-3>
      ...
   </Level-2>
   ...
</Level-1>
<Level-1>
   ...
   <Level-2>
      ...
      <Level-3>
         <Level-4>B</Level-4>
      </Level-3>
      ...
   </Level-2>
   ...
</Level-1>

Now, if we change the line:

 <xsl:param name="pLeafNodes" select="//Level-4"/>

to:

 <xsl:param name="pLeafNodes" select="//Job"/>

and apply the transformation to the Employee XML document:

<Employee name="A Name">
    <Address>123 A Street</Address>
    <Age>28</Age>
    <EmploymentHistory>
        <Employment country="US">
            <Comment>List of previous jobs in the US</Comment>
            <Jobs>3</Jobs>
            <JobDetails>
                <Job title = "Senior Developer">
                    <StartDate>01/10/2001</StartDate>
                    <Months>38</Months>
                </Job>
                <Job title = "Senior Developer">
                    <StartDate>01/12/2004</StartDate>
                    <Months>6</Months>
                </Job>
                <Job title = "Senior Developer">
                    <StartDate>01/06/2005</StartDate>
                    <Months>10</Months>
                </Job>
            </JobDetails>
        </Employment>
    </EmploymentHistory>
    <EmploymentHistory>
        <Employment country="UK">
            <Comment>List of previous jobs in the UK</Comment>
            <Jobs>2</Jobs>
            <JobDetails>
                <Job title = "Junior Developer">
                    <StartDate>01/05/1999</StartDate>
                    <Months>25</Months>
                </Job>
                <Job title = "Junior Developer">
                    <StartDate>01/07/2001</StartDate>
                    <Months>3</Months>
                </Job>
            </JobDetails>
        </Employment>
    </EmploymentHistory>
    <Available>true</Available>
    <Experience unit="years">6</Experience>
</Employee>

we again get the wanted, correct result:

<t>
   <Employee name="A Name">
      <Address>123 A Street</Address>
      <Age>28</Age>
      <EmploymentHistory>
         <Employment country="US">
            <Comment>List of previous jobs in the US</Comment>
            <Jobs>3</Jobs>
            <JobDetails>
               <Job title="Senior Developer">
                  <StartDate>01/10/2001</StartDate>
                  <Months>38</Months>
               </Job>
            </JobDetails>
         </Employment>
      </EmploymentHistory>
      <Available>true</Available>
      <Experience unit="years">6</Experience>
   </Employee>
   <Employee name="A Name">
      <Address>123 A Street</Address>
      <Age>28</Age>
      <EmploymentHistory>
         <Employment country="US">
            <Comment>List of previous jobs in the US</Comment>
            <Jobs>3</Jobs>
            <JobDetails>
               <Job title="Senior Developer">
                  <StartDate>01/12/2004</StartDate>
                  <Months>6</Months>
               </Job>
            </JobDetails>
         </Employment>
      </EmploymentHistory>
      <Available>true</Available>
      <Experience unit="years">6</Experience>
   </Employee>
   <Employee name="A Name">
      <Address>123 A Street</Address>
      <Age>28</Age>
      <EmploymentHistory>
         <Employment country="US">
            <Comment>List of previous jobs in the US</Comment>
            <Jobs>3</Jobs>
            <JobDetails>
               <Job title="Senior Developer">
                  <StartDate>01/06/2005</StartDate>
                  <Months>10</Months>
               </Job>
            </JobDetails>
         </Employment>
      </EmploymentHistory>
      <Available>true</Available>
      <Experience unit="years">6</Experience>
   </Employee>
   <Employee name="A Name">
      <Address>123 A Street</Address>
      <Age>28</Age>
      <EmploymentHistory>
         <Employment country="UK">
            <Comment>List of previous jobs in the UK</Comment>
            <Jobs>2</Jobs>
            <JobDetails>
               <Job title="Junior Developer">
                  <StartDate>01/05/1999</StartDate>
                  <Months>25</Months>
               </Job>
            </JobDetails>
         </Employment>
      </EmploymentHistory>
      <Available>true</Available>
      <Experience unit="years">6</Experience>
   </Employee>
   <Employee name="A Name">
      <Address>123 A Street</Address>
      <Age>28</Age>
      <EmploymentHistory>
         <Employment country="UK">
            <Comment>List of previous jobs in the UK</Comment>
            <Jobs>2</Jobs>
            <JobDetails>
               <Job title="Junior Developer">
                  <StartDate>01/07/2001</StartDate>
                  <Months>3</Months>
               </Job>
            </JobDetails>
         </Employment>
      </EmploymentHistory>
      <Available>true</Available>
      <Experience unit="years">6</Experience>
   </Employee>
</t>

Explanation: The processing is done in a named template (StructRepro) and controlled by a single external parameter named pLeafNodes, that must contain a nodeset of all nodes whose "upward structure" is to be reproduced in the result.

Sign up to request clarification or add additional context in comments.

14 Comments

Excellent and concise! Is there any way of avoiding putting down the "Level-4" tag at all? That would make it 100% generic. :-)
@PNS:Yes, we can define this code as a template (named or moded) and the parameter it uses can be a node-set, containing all nodes that must be processed as "leaf: nodes. I am also trying to find a better word to describe this process. "Sred" is obviously the opposite of what happens here (and at the root of my so long confusion). In fact, what we have is "structure reproduction" -- something like reproduction by cell division.
@PNS: I edited my answer to provided the fully generic solution you were looking for. Enjoy. :)
@PNS: Are you saying that you do not get the same output that I am getting? If so, the reason can be that either you have modified the source XML or you have modified the XSLT code, or your XSLT processor may be buggy. I have run this transformation with many different XSLT processors and with all of them I get the result that is presented in my answer.
@PNS: With all of the following XSLT processors I get (and you or anybody else can run the transformation and confirm this) the same result: MSXML3, MSXML4, MSXML6, .NET XslCompiledTransform, .NET XslTransform, Saxon 6.5.4, Saxon 9.1.07, XQSharp.
|
3

Given the following XML:

<?xml version="1.0" encoding="utf-8" ?>
<Employee name="A Name">
  <Address>123 A Street</Address>
  <Age>28</Age>
  <EmploymentHistory>
    <Employment country="US">
      <Comment>List of previous jobs in the US</Comment>
      <Jobs>3</Jobs>
      <JobDetails>
        <Job title = "Developer">
          <StartDate>01/10/2001</StartDate>
          <Months>38</Months>
        </Job>
        <Job title = "Developer">
          <StartDate>01/12/2004</StartDate>
          <Months>6</Months>
        </Job>
        <Job title = "Developer">
          <StartDate>01/06/2005</StartDate>
          <Months>10</Months>
        </Job>
      </JobDetails>
      </Employment>
      <Employment country="UK">
        <Comment>List of previous jobs in the UK</Comment>
        <Jobs>2</Jobs>
        <JobDetails>
          <Job title = "Developer">
            <StartDate>01/05/1999</StartDate>
            <Months>25</Months>
          </Job>
          <Job title = "Developer">
            <StartDate>01/07/2001</StartDate>
            <Months>3</Months>
          </Job>
        </JobDetails>
        </Employment>
  </EmploymentHistory>
  <Available>true</Available>
  <Experience unit="years">6</Experience>
</Employee>

The following XSLT:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="/">
      <Output>
        <xsl:apply-templates select="//Employee/EmploymentHistory/Employment/JobDetails/Job" />
      </Output>
    </xsl:template>

  <xsl:template match="//Employee/EmploymentHistory/Employment/JobDetails/Job">
    <Employee>
      <xsl:attribute name="name">
        <xsl:value-of select="ancestor::Employee/@name"/>
      </xsl:attribute>
      <Address>
        <xsl:value-of select="ancestor::Employee/Address"/>
      </Address>
      <Age>
        <xsl:value-of select="ancestor::Employee/Age"/>
      </Age>
      <EmploymentHistory>
        <Employment>
          <xsl:attribute name="country">
            <xsl:value-of select="ancestor::Employment/@country"/>
          </xsl:attribute>
          <Comment>
            <xsl:value-of select="ancestor::Employment/Comment"/>
          </Comment>
          <Jobs>
            <xsl:value-of select="ancestor::Employment/Jobs"/>
          </Jobs>
          <JobDetails>
            <xsl:copy-of select="."/>
          </JobDetails>
          <Available>
            <xsl:value-of select="ancestor::Employee/Available"/>
          </Available>
          <Experience>
            <xsl:attribute name="unit">
              <xsl:value-of select="ancestor::Employee/Experience/@unit"/>
            </xsl:attribute>
            <xsl:value-of select="ancestor::Employee/Experience"/>
          </Experience>
        </Employment>
      </EmploymentHistory>
    </Employee>

  </xsl:template>


</xsl:stylesheet>

Gives the following output:

<?xml version="1.0" encoding="utf-8"?>
<Output>
  <Employee name="A Name">
    <Address>123 A Street</Address>
    <Age>28</Age>
    <EmploymentHistory>
      <Employment country="US">
        <Comment>List of previous jobs in the US</Comment>
        <Jobs>3</Jobs>
        <JobDetails>
          <Job title="Developer">
          <StartDate>01/10/2001</StartDate>
          <Months>38</Months>
        </Job>
        </JobDetails>
        <Available>true</Available>
        <Experience unit="years">6</Experience>
      </Employment>
    </EmploymentHistory>
  </Employee>
  <Employee name="A Name">
    <Address>123 A Street</Address>
    <Age>28</Age>
    <EmploymentHistory>
      <Employment country="US">
        <Comment>List of previous jobs in the US</Comment>
        <Jobs>3</Jobs>
        <JobDetails>
          <Job title="Developer">
          <StartDate>01/12/2004</StartDate>
          <Months>6</Months>
        </Job>
        </JobDetails>
        <Available>true</Available>
        <Experience unit="years">6</Experience>
      </Employment>
    </EmploymentHistory>
  </Employee>
  <Employee name="A Name">
    <Address>123 A Street</Address>
    <Age>28</Age>
    <EmploymentHistory>
      <Employment country="US">
        <Comment>List of previous jobs in the US</Comment>
        <Jobs>3</Jobs>
        <JobDetails>
          <Job title="Developer">
          <StartDate>01/06/2005</StartDate>
          <Months>10</Months>
        </Job>
        </JobDetails>
        <Available>true</Available>
        <Experience unit="years">6</Experience>
      </Employment>
    </EmploymentHistory>
  </Employee>
  <Employee name="A Name">
    <Address>123 A Street</Address>
    <Age>28</Age>
    <EmploymentHistory>
      <Employment country="UK">
        <Comment>List of previous jobs in the UK</Comment>
        <Jobs>2</Jobs>
        <JobDetails>
          <Job title="Developer">
            <StartDate>01/05/1999</StartDate>
            <Months>25</Months>
          </Job>
        </JobDetails>
        <Available>true</Available>
        <Experience unit="years">6</Experience>
      </Employment>
    </EmploymentHistory>
  </Employee>
  <Employee name="A Name">
    <Address>123 A Street</Address>
    <Age>28</Age>
    <EmploymentHistory>
      <Employment country="UK">
        <Comment>List of previous jobs in the UK</Comment>
        <Jobs>2</Jobs>
        <JobDetails>
          <Job title="Developer">
            <StartDate>01/07/2001</StartDate>
            <Months>3</Months>
          </Job>
        </JobDetails>
        <Available>true</Available>
        <Experience unit="years">6</Experience>
      </Employment>
    </EmploymentHistory>
  </Employee>
</Output>

Note that I've added an Output root element to ensure the document is well formed.

Is this what you wanted?

You might also be able to use xsl:copy to copy the higher level elements, but I need to think about this one a bit more. With the above xslt, you have more control, but also you have to redefine your elements...

16 Comments

I am actually looking for generic XSLT code (like in the answer to stackoverflow.com/questions/1900184/…), if possible. But, even if that can't be done, I am grateful for the help anyway (and testing it now)! :-)
The problem with that XSLT is that it is doing exactly what Dimitre is talking about - it's flattening a hierarchy. What you actually want to do is repeat all of JobDetails ancestors, but exclude all of the sibling JobDetails. An additional problem is the Employment country="" subtree - this makes it harder to establish which elements you don't want to keep. If that makes sense?
The flattening procedure is very deterministic: every element (in this case, <Job/>) that is repeated, produces a new full block with all the other information intact and only itself being the difference. In general, this applies to all hierarchy levels. If, say, there were M elements repeated in a level above <Job/> and below each of these M elements there were N <Job/> different elements, then a total of MxN <Employee> blocks would be produced. Imagine needing to put this in a database, where each row would be a different <Employee> block "path" every time. Essentially, it is XML shredding.
I get that :-) The problem is that a generic solution would probably involve xsl:copy-of and excluding all of the other siblings is difficult. This actually might be easier in DOM processing!
@PNS: Please, clarify your last comment above and include it in the question, giving a complete and correct wanted result. If the operation has nothing to do with "flattening" call it with a more appropriate, meaningful and not confusing name.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.