0

I try to transform XHTML webpage using XSLT by extracting some of its parts. For example, I'd like to extract HEAD and BODY parts separately (it's only first step, next will be extracting some divs) and use them in my output XHTML document. Here is XSLT code:

<xsl:stylesheet version="2.0"
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
  xmlns="http://www.w3.org/1999/xhtml"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xhtml xsl xs">

<xsl:output
  method="html"
  omit-xml-declaration="yes"
  doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
  doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
  indent="yes"/>


<xsl:template match="/">
  <HTML>
      <xsl:apply-templates/>
  </HTML>
</xsl:template>

<xsl:template match="xhtml:HTML/xhtml:BODY">
 <xsl:copy-of select="." disable-output-escaping="yes" />
</xsl:template>


<xsl:template match="xhtml:HTML/xhtml:HEAD">
  <xsl:copy-of select="." disable-output-escaping="yes"/>
</xsl:template>

</xsl:stylesheet>

As an input XHTML I have www.wordpress.org/about source code (validating). As first neko purifier is fired (HTML->XHTML) and then my xslt transformation. When I take a look into output code everything looks similar:

Original code: codepad.org/5D7MCXSk
Code after transformation: http://codepad.org/fGzyAwF2

Except, when I open it in web browser I get "white wall" - nothing appears. I noticed that in source code of transformed site (both on chrome and firefox) syntax is highlighted up to the closing HEAD tag. It is very weird and I thing that it is causing the problem.

Any help will be very appreciated. Thanks in advance

4
  • Well it is not clear what you want to achieve, your root element in the stylesheet has xmlns="http://www.w3.org/1999/xhtml" which suggests you want to output XHTML element. Your xsl:output also suggests you want to output an XHTML document. However XHTML is case-sensitive and all its elements and attributes are defined to be lower case so I don't understand then why you have a literal result element with name HTML. So using lower-case element and attribute names for any result elements is a first step to have a meaningful XHTML result document generated by your transformation. Commented Jan 31, 2011 at 16:37
  • (second comment as the first got too long). If the input is XHTML and you want to match on XHTML elements in your patterns then there you also need lower-case names e.g. match="xhtml:html/xhtml:head". If you still have problems then tell us two things, first of all whether you serve the transformation result as text/html or with an XML MIME type like application/xml or application/xml, and secondly, what result document you want to create from your input. Commented Jan 31, 2011 at 16:51
  • Are you performing the transformation client side or server side? What are your Content-Type headers? Commented Jan 31, 2011 at 16:51
  • I am sorry, maybe question was not 100% clear. What I am trying to achieve is to extract from input XHTML document some parts (let's say that it is div with id=main and div with id=bottom) along with all their sub-content and display it in output XHTML document. Everything using XSL transformation. It is transforming one XHTML into another. But I stucked at the very beginning - I could not move HEAD and BODY separately, and this is first point. Extracting other parts is second. Thanks! Commented Jan 31, 2011 at 18:43

1 Answer 1

1

So it seems that http://codepad.org/5D7MCXSk (code 1) is the same as the source code of http://wordpress.org/about/ (code 2) and you process this code with "neko purifier" (is it this one: http://nekohtml.sourceforge.net/ ?) resulting the document in http://codepad.org/fGzyAwF2 (code 3). Correct me if I'm wrong.

The reason why code 3 doesn't show anything in the browser seems to be a self closing <SCRIPT/> at the end of the <HEAD>. YMMW, but in my tests for some reason the browsers didn't seem to like it.

Your XSLT code is slightly flawed but if you feed the code 3 as input, it produces an output. The quirk of the input file, that self closing script element, is preserved in the transformation.

Some random notes:

  • The original input (code 1) is well formed XML, so you don't need to "purify" it
  • <xsl:copy-of> doesn't have attribute disable-output-escaping
  • There is no sense in defining a default namespace for output document when using method="html" because html doesn't use namespaces (unlike xhtml)
Sign up to request clarification or add additional context in comments.

2 Comments

first I run neko purifier (the same as in you link), then I run XSLT transformation. I know that wordpress is valid XHTML site, but the whole mechanism will work also on different sites. This one is just a starting point. Thanks
you are right - the problem is self-closing <SCRIPT /> tag at the end of HEAD section. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.