Can I parse an HTML using XSLT?

Question

I have to parse a big HTML file, and Im only interested in a small section (a table). So I thought about using an XSLT to simplify/transform the HTML in something simpler that I could then easily process.

The problem Im having is that the is not finding my table. So I don't know if its even possible to parse HTML using a XSL stylesheet.

By the way, the HTML file has this look (schematic, missing tags):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html id="ctl00_htmlDocumento" xmlns="http://www.w3.org/1999/xhtml" lang="es-ES" xml:lang="es-ES">
<div> some content </div>
<div class="NON_IMPORTANT"></div>
<div class="IMPORTANT_FATHER>
    <div class="IMPORTANT">
        <table>
            HERE IS THE DATA IM LOOKING FOR
        </table>
    </div>
</div>

as per request, here is my xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="tbody">
        tbody found, lets process it
    <xsl:for-each select="tr">
        new tf found, lets process it
    </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>

The full HTML is quite big so I dont know how to present it here... I've tested for valid document on Oxygen, and it says its valid.

Thanks in advance. Gonso

XSLT is used to perform transformations on the input document and not parsing. Also without showing your HTML and XSLT documents you can't expect to get a helpful answer. — Darin Dimitrov
– Darin Dimitrov, Commented Oct 28, 2009 at 19:46
You might want to show how you are trying to use the stylesheet and a snippet of the stylesheet that is failing. — James Black
– James Black, Commented Oct 28, 2009 at 19:47
You can do this, but I think you will have trouble. You should use a html parser in your language that supports sloppy html. — Byron Whitlock
– Byron Whitlock, Commented Oct 28, 2009 at 19:51
Since your document is XHTML, XSLT should work on it, so there's probably something wrong with your stylesheet. Without seeing the actual stylesheet trying to handle the table, and probably also the HTML structure leading to the table, it's impossible to say more. — JaakkoK
– JaakkoK, Commented Oct 28, 2009 at 19:57

JaakkoK · Accepted Answer · 2009-10-29 07:21:16Z

5

You're not using XPath correctly in your match attributes. You need the xmlns:xhtml="http://www.w3.org/1999/xhtml" attribute in your xsl:stylesheet element, and then you'll need to use the xhtml: prefix in your XPath expressions (you need a prefix; XPath does not obey default namespaces).

After this, you'll still get the problem that it will process everything else too. I don't know if there's a better solution to this, but I think you will need to explicitly process things on the path to the tbody element, something like

<xsl:template match="xhtml:html">
  <xsl:apply-templates select="xhtml:body"/>
</xsl:template>

and the same thing for body and so on until you get to your tbody match.

XPath also supports more complex matching than just a specific child as above. For instance, matching the third child div tag can be done with

<xsl:template match="xhtml:div[3]">

and matching an element with a specific attribute with

<xsl:template match="xhtml:div[@class='IMPORTANT']">

Here the [] surrounds an additional condition that needs to be fulfilled for the element to be considered a match. A plain number means to index into the matches and take only the one that has that index (the indexing is 1-based), an @ sign precedes an attribute, but you can have arbitrarily complex XPath in there, so you can match pretty much any substructure you'd like.

edited Oct 29, 2009 at 7:21

answered Oct 28, 2009 at 20:49

JaakkoK

8,3972 gold badges36 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

gonso Over a year ago

Thanks, that helped. Is there no way to search for an inner tag directly? either by just stating its name or full path?

JaakkoK Over a year ago

You can match it directly, but then you'll get the default behavior for everything else, which is to print their text content. And if you try to use <xsl:template match="*"/> to eliminate the default behavior, that's going to affect your root element too, so it won't process the document at all. So if you want processing on only a single element in the document, I think you do need to override the default behavior on all elements along the path to that element.

gonso Over a year ago

Almost there. I've added the HTML schema. My question is: How to dive into the right "div" tag. There are several, but Im just interested in some of them. Can I filter by occurrence (i.e. the third div tag) or by an attribute (i.e. div tag where class="IMPORTANT")? thanks again

Pavel Minaev Over a year ago

@jk, an easier approach is to do <xsl:template match="*|text()"><xsl:apply-templates/></xsl:template> - this removes all output for default rule, but ensures that the entire tree is processed (so other rules may still match).

gonso Over a year ago

@Pavel, Thanks! that really simplified my xsl. May I ask you how/why it works? How do you read the "*|text()" condition? Any tag or the text inside any tag? If that is the case, how come this rule doesn't take precedence over other tag-specific rules I've created? Like this: <xsl:template match="*|text()"> <xsl:apply-templates/> </xsl:template> <xsl:template match="xhtml:tbody"> process tbody </xsl:template> </xsl:stylesheet> Thanks!

Christian Hayter · Accepted Answer · 2009-10-28 19:57:47Z

As long as your XHTML document is well-formed, an XML parser will be able to read it, and therefore an XSLT engine will be able to transform it.

Assuming that, the most common causes of not being able to find elements in a document are:

Your XPath expression is being executed relative to a different node that what you thought it was going to be. What this means for your XSLT - check that your XSLT match patterns are correct relative to their templates.
You have not defined the namespace URI-to-prefix mappings in your XPath engine. What this means for your XSLT - make sure you have the xmlns="http://www.w3.org/1999/xhtml" namespace declared in your XSLT file, with or without a prefix.

If you post your XSLT I will be able to comment further.

Welbog · Accepted Answer · 2009-10-28 19:48:41Z

3

You can use XSLT to manipulate HTML assuming the HTML is well formatted (as in the HTML document is a well-formed XML document in the strictest sense).

If you can confirm this, and your XSLT isn't working, maybe you should provide a more thorough snippet of both the HTML and XSLT documents so that we can figure it out.

answered Oct 28, 2009 at 19:48

Welbog

60.8k9 gold badges114 silver badges125 bronze badges

Collectives™ on Stack Overflow

Can I parse an HTML using XSLT?

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related