0

I'm doing a html to xml transformation with XSLT. I find a quite a difficult task on html table (with merged cells) transform to xml.

Here is the scenario,

My input html table,

<table>
    <thead>
        <tr>
            <td rowspan="3">Date</td>
            <td colspan="5">Customer Price Index</td>
            <td rowspan="3"> private consumption chain price </td>
            <td colspan="2"> Other consumer price mesure </td>
        </tr>
        <tr>
            <td rowspan="2"> All groups </td>
            <td rowspan="2"> Excluding volatile items </td>
            <td colspan="3">Market prices excluding volatile items</td>
            <td colspan="2"> Based on seasonally adjusted quntity price changers </td>
        </tr>
        <tr>
            <td>Goods</td>
            <td>Services</td>
            <td>Total</td>
            <td> weihgted median </td>
            <td>Trimmed mean</td>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>2003/04</td>
            <td colspan="8">content</td>
        </tr>
        <tr>
            <td>Dec</td>
            <td>2.4</td>
            <td>2.4</td>
            <td>1.6</td>
            <td>2.2</td>
            <td>1.8</td>
            <td>1.0</td>
            <td>2.0</td>
            <td>2.5</td>
        </tr>
    </tbody>
</table>

desired xml output ,

  <table>
        <thead>
            <row>
                <data namest="1" morerows="2">
                    <p>Date</p>
                </data>
                <data namest="2" nameend="6">
                    <p>Consumer price index</p>
                </data>
                <data namest="7" morerows="2">
                    <p>Private consumption chain price index</p>
                </data>
                <data namest="8" nameend="9">
                    <p>Other consumer price mesure</p>
                </data>
            </row>
            <row>
                <data namest="2" morerows="1">
                    <p>All groups</p>
                </data>
                <data namest="3" morerows="1">
                    <p>Excluding volatile items</p>
                </data>
                <data namest="4" nameend="6">
                    <p>Market prices excluding volatile items</p>
                </data>
                <data namest="8" nameend="9">
                    <p>Based on seasonally adjusted quntity price changers</p>
                </data>
            </row>
            <row>
                <data namest="4">
                    <p>Goods</p>
                </data>
                <data namest="5">
                    <p>Services</p>
                </data>
                <data namest="6">
                    <p>Total</p>
                </data>
                <data namest="8">
                    <p>Weighted median</p>
                </data>
                <data namest="9">
                    <p>Trimmed mean</p>
                </data>
            </row>
        </thead>
        <tbody>
            <row>
                <data namest="1">
                    <p>2003/04</p>
                </data>
                <data namest="2" nameend="9">
                    <p>content</p>
                </data>
            </row>
            <row>
                <data namest="1">
                    <p>Dec</p>
                </data>
                <data namest="2">
                    <p>2.4</p>
                </data>
                <data namest="3">
                    <p>2.4</p>
                </data>
                <data namest="4">
                    <p>1.6</p>
                </data>
                <data namest="5">
                    <p>2.2</p>
                </data>
                <data namest="6">
                    <p>1.8</p>
                </data>
                <data namest="7">
                    <p>1.0</p>
                </data>
                <data namest="8">
                    <p>2.8</p>
                </data>
                <data namest="9">
                    <p>2.5</p>
                </data>
            </row>
        </tbody>
</table>

As you can see vertical cell merging represent as rowspan attr and horizontal merging represent as colspan attr in the input html.

and in expected output namest attr represent the cell starting column number and morerows attr represent how many number of cell merge down (vertical) and nameend attr represent last cell column number (horizontal merge).

This scenario can be solved by another languages using data structures (two dimensional arrays) but I'm struggling to find a effective method to do this task using XSLT.

I wrote following xsl to do this task, and it works for the first row but for the other rows this method is too complicated.

 <xsl:template match="td[parent::tr[not(preceding::tr)]]">

        <xsl:variable name="pre_rowspan" select="number(format-number(count(preceding-sibling::td[@rowspan])+1, '#0', 'myformat'))"/>
        <xsl:variable name="pre_colspan" select="number(format-number(preceding-sibling::td[@colspan]/@colspan, '#0', 'myformat'))"/>
        <xsl:variable name="numberof_pre_rowspan" select="number(format-number(count(preceding-sibling::td[@rowspan])+1, '#0', 'myformat'))"/>

        <data>
            <xsl:attribute name="namest" select="number($pre_rowspan + $pre_colspan)"/>
            <xsl:if test="@rowspan">
                <xsl:attribute name="morerows" select="number(@rowspan)-1"/>
            </xsl:if>
            <xsl:if test="@colspan">
                <xsl:attribute name="nameend" select="number(@colspan)+number(format-number(count(preceding-sibling::td[@rowspan]), '#0', 'myformat'))+number(format-number(number(preceding-sibling::td[@colspan]/@colspan), '#0', 'myformat'))"/>
            </xsl:if>
            <xsl:if test="@rowspan and @colspan">
                <xsl:attribute name="nameend" select="$pre_rowspan"/>
            </xsl:if>

            <p>
                <xsl:apply-templates/>
            </p>

        </data>
    </xsl:template>

SO, Can anyone suggest me a method how can I do this task using xslt. (using data structure or any other method)

1 Answer 1

1

Yes, it's quite a tough one, and I'm only going to sketch an approach.

It's going to involve sibling recursion, first through the sibling td's within a tr, and then through the sibling tr's. As you move through the recursion, I think you need to pass a data structure representing which cells are occupied, and I would suggest doing this as a sequence of strings, e.g. ("XXX", "X-X", "--X") indicates that the first three cells in row 1 are occupied, the 1st and 3rd cells in row 2 are occupied, and so on.

If I understand rowspan and colspan correctly, the rule is that for a td in the Nth tr, it will always have a start-row of N, and it will occupy the first available column such that all the required cells are free, provided this is to the right of all previous cells starting in the same row.

So I'd suggest that when your recursion gets to a particular td, you pass three parameters: the row number $row, the first free column in that row $firstFreeCol, and the occupancy table, a sequence of strings as above. Given these three values, plus the values of rowspan and colspan, you then test whether (in the occupancy table) every row between $row and ($row+ @rowspan - 1) has every column between $firstFreeCol and ($firstFreeCol + @colspan - 1) free. If not, repeat with $firstFreeCol $firstFreeCol + 1. If it is free, output this cell, with its allocated coordinates, and proceed to the next one with $firstFreeCol set to $firstFreeCol + $colSpan and with the occupancy table updated to set the occupied cells to "X"s.

I don't know how familiar you are with the use of recursion to achieve this kind of effect. In my book, I did the example of the "Knight's Tour" with the express aim of illustrating that complex algorithms like this are entirely possible in XSLT, though if you're new to functional programming then it takes a while for this to come naturally. The Knight's Tour also has a similar need to be inventive with data structures given the limited set of facilities available (it all gets easier with maps and arrays in XSLT 3.0...). You'll need a range of utility functions, for example here's an (untested) function that marks a particular cell in the occupancy table as occupied, and returns a new occupancy table:

<xsl:function name="f:set-occupied-cell" as="xs:string*">
  <xsl:param name="occupancy" as ="xs:string*"/>
  <xsl:param name="row" as="xs:integer"/>
  <xsl:param name="col" as="xs:integer"/>
  <xsl:sequence select="
     for $i in 1 to $row - 1
     return if ($i gt count($occupancy)) then "" else $occupancy[$i]"/>
  <xsl:variable name="target-row" select="$occupancy[$row]"/>
  <xsl:sequence select="concat(
    string-join(
     for $i in 1 to $col - 1
     return if ($i gt string-length($target-row) 
            then "-" 
            else substring($target-row, $i, 1), ''),
    "X",
    substring($target-row, $col+1)"/>
  <xsl:sequence select="subsequence($occupancy, $row+1)"/>
</xsl:function>
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the answer.. even if you explain your approach widely it's hard for me to understand how that method is actually works since I'm not very familiar with functional programming like XSLT. (eg: how can I identify first free column in that row $firstFreeCol)
Yes, it's a challenge. But a fun challenge if you're up for it!
You might find it worthwhile studying Wendell Piez's stylesheets for manipulating CALS and OASIS tables at github.com/wendellpiez/JATSPreviewStylesheets/tree/master/xslt/… - there could be some reusable code there, or you might just find the design approach stimulating

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.