extract cdata using xslt

Question

Below is the xml that has CDATA section

<?xml version="1.0" encoding="ISO-8859-1"?>
<character>
<name>
<role>Indiana Jones</role>
<actor>Harrison Ford</actor>
<part>protagonist</part>
<![CDATA[  <film>Indiana Jones and the Kingdom of the Crystal Skull</film>]]>
</name>
</character>

For above xml i need to rip off the CDATA and add new element under the existing element "film" , so the final output will be :

<?xml version="1.0" encoding="ISO-8859-1"?>
<character>
<name>
<role>Indiana Jones</role>
<actor>Harrison Ford</actor>
<part>protagonist</part>
<film>Indiana Jones and the Kingdom of the Crystal Skull</film>
<Language>English</Language>
</name>
</character>

Is this can be done using XSLT?

Where does <Language>English</Language> come from in the output? Perhaps it was supposed to be part of the input? — james.garriss
– james.garriss, Commented Aug 20, 2012 at 15:28

james.garriss · Accepted Answer · 2012-08-20 15:30:30Z

A slightly modified identify function should work.

Given this XML:

<?xml version="1.0" encoding="ISO-8859-1"?>
<character>
    <name>
        <role>Indiana Jones</role>
        <actor>Harrison Ford</actor>
        <part>protagonist</part>
        <![CDATA[  <film>Indiana Jones and the Kingdom of the Crystal Skull</film>]]>
    </name>
</character>

Using this XSLT:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*" />
            <xsl:value-of select="text()" disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Will produce this output:

<?xml version="1.0" encoding="UTF-8"?>
<character>
   <name>
      <role>Indiana Jones</role>
      <actor>Harrison Ford</actor>
      <part>protagonist</part>
          <film>Indiana Jones and the Kingdom of the Crystal Skull</film>
    </name>
</character>

(Tested using Saxon-HE 9.3.0.5 in oXygen 12.2.)

Mads Hansen · Accepted Answer · 2010-09-14 11:03:47Z

2

Since the film element in the CDATA block appears to be well-formed, you could use disable-output-escaping. If you match of the name/text(), select value-of with DOE and then insert the Language element immediately following.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"  />

<!--Identity template simply copies content forward -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>


<xsl:template match="name/text()">
    <!--disable-output-escaping will prevent the "film" element from being escaped.
    Since it appears to be well-formed you should be safe, but no guarentees -->
    <xsl:value-of select="." disable-output-escaping="yes" />
    <Language>English</Language>
</xsl:template>

</xsl:stylesheet>

answered Sep 14, 2010 at 11:03

Mads Hansen

67.6k12 gold badges119 silver badges154 bronze badges

1 Comment

user357812 Over a year ago

+1 If DOE is posible and there is a strong certainty that CDATA is wellformed

Per T · Accepted Answer · 2010-09-14 11:15:32Z

1

Another way to solve this which would give you some more control over the transformation is to use Andrew Welsh LexEv XMLReader. This gives you the possibility to process CDATA sections as markup among other things.

answered Sep 14, 2010 at 11:15

Per T

2,03817 silver badges15 bronze badges

1 Comment

LarsH Over a year ago

+1 Interesting solution. Note @Madhu that this is not an XSLT solution but works by supplying a different XML parser to the XSLT processor. May require a Java XSLT processor. If you have control over your XSLT environment enough to use this, it will take care of your parsing problems in a very complete way.

LarsH · Accepted Answer · 2010-09-14 15:37:03Z

0

First, the fact that your input XML has "CDATA" is in one sense irrelevant... the XSLT can't tell whether it's CDATA or not. What's key about your input XML is that you have escaped markup <film>...</film>, and you want to turn it into a real element.

If you know that the escaped element will always have a certain name ('film'), and you know where it occurs, you can strip it and replace it easily:

   <xsl:template match="text()[contains(., '&lt;film>')]">
      <film>
         <xsl:value-of select="substring-before(substring-after(., '&lt;film>'),
              '&lt;/film>')"/>
      </film>
   </xsl:template>

If you don't know in advance where the escaped tags will occur and what the element names are, you could use XSLT 2.0's <xsl:analyze-string> to find and replace them. But as Alejandro pointed out, general parsing of XML using regular expressions can get very messy. It would only be feasible if you know the markup will be simple.

edited Sep 14, 2010 at 15:37

answered Sep 14, 2010 at 11:16

LarsH

28.1k9 gold badges99 silver badges162 bronze badges

7 Comments

Mads Hansen Over a year ago

+1 a little more exact, in case there are multiple name/text(). Good defensive coding

Madhu CM Over a year ago

rather you can add <xsl:value-of disable-output-escaping="yes" select="substring-after(.,'<?xml version="1.0" encoding="utf-8"?>')" />

Dimitre Novatchev Over a year ago

+1 for the good explanation. One must mention that the general case requires a function like saxon:parse() -- probably we will soon have a standard one in the next version of F&O.

LarsH Over a year ago

@Madhu No, that won't work because the XSLT doesn't see <?xml version=...>. It's not part of the source document tree. Even if it were, taking value-of . (which I assume to be /) would lose all the elements of the document: their tags would be absent from the output. Also, a big reason for the above is to avoid disable-output-escaping, which is a kludge that is usually avoidable if you treat markup as markup and text as text. XSLT processors aren't even required to honor d-o-e. In some environments, they can't.

user357812 Over a year ago

@LarsH: I think you should test for contains(.,'lt;film>'). Also, I don't think is a good practice to recommend to parse XML with RegExp...

|

Neeku · Accepted Answer · 2013-11-11 13:59:01Z

I was dealing with something similar and I found a good solution so I thought of sharing it with you, but this one is for NSXMLParser.

If you're using NSXMLParser there's a delegate method called foundCDATA which can look like this:

- (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock{
    if (!parseElement) {
        return;
    }
    if (parsedElementData==nil) {
        parsedElementData = [[NSMutableData alloc] init];
    }
    [parsedElementData appendData:CDATABlock];

    //Grabs the whole content in CDATABlock.
    NSMutableString *content = [[NSMutableString alloc] initWithData:CDATABlock encoding:NSUTF8StringEncoding];

 }

Now add this prewritten class to your project. Then import it to the parser class you want to use it in:

#import NSString_stripHTML

Now simply you can add the following line to foundCDATAmethod:

NSString *strippedContent;
strippedContent = [content strippedHtml];

Now you will have the stripped text without any extra characters. You can substring whatever you want from this stripped text.

Collectives™ on Stack Overflow

extract cdata using xslt

5 Answers 5

Comments

1 Comment

1 Comment

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

1 Comment

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related