2

I have a large number of html files like the following 01.html file:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>My Title</title> 
  </head>
  <body>
    <item itemprop="itemprop1" content="content1" /> 
    <item itemprop="itemprop2" content="content2" /> 
    <item itemprop="itemprop3" content="content3" /> 
    <item itemprop="itemprop4" content="content4" />
    <item itemprop="itemprop5" content="content5" />
    <item itemprop="itemprop6" content="content6" />
    <item itemprop="itemprop7" content="content7" />
    <item itemprop="itemprop8" content="content8" />
    <item itemprop="itemprop9" content="content9" />
  </body>
</html>

There is only one item node with itemprop="itemprop1" in each html file. Same for itemprop2, itemprop3, etc.

I would like to have the following txt file output:

content1 | content 5

that is the concatenation of: 1. the value of the attribute content for the item with itemprop="itemprop1" 2. a pipe "|" 3. the value of the attribute content for the item with itemprop="itemprop5"

I run the following bash script:

xsltproc 01.xslt 01.html >> 02.txt

where 01.xslt is the following:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="body">
  <xsl:value-of select="//item[@itemprop='itemprop1']/@content"/>|<xsl:value-of select="item[@itemprop='itemprop5']/@content"/>
 </xsl:template>

</xsl:stylesheet>

Unfortunately it doesn't work. What is the correct xslt file?

UPDATE

This is the final working example.

01.html is the following:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>My Title</title> 
  </head>
  <body>
    <item itemprop="itemprop1" content="content1" /> 
    <item itemprop="itemprop2" content="content2" /> 
    <item itemprop="itemprop3" content="content3" /> 
    <item itemprop="itemprop4" content="content4" />
    <item itemprop="itemprop5" content="content5" />
    <item itemprop="itemprop6" content="content6" />
    <item itemprop="itemprop7" content="content7" />
    <item itemprop="itemprop8" content="content8" />
    <item itemprop="itemprop9" content="content9" />
  </body>
</html>

01.xslt is the following:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes" method="text"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="html">
  <xsl:value-of select="//item[@itemprop='itemprop1']/@content"/>
  <xsl:text>|</xsl:text>
  <xsl:value-of select="//item[@itemprop='itemprop5']/@content"/>
 </xsl:template>

</xsl:stylesheet>

and the output 02.txt is the following:

content1|content5
4
  • What is being output that you do not want? Does it not work because it is outputting the title? Commented Jun 25, 2018 at 18:47
  • Your main problem using xsltproc is that you're trying to process HTML instead of XML. The difference is in the <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> tag which isn't closed and hence there is no valid XML for the XSLT processor (what results in an error). Commented Jun 25, 2018 at 19:04
  • @Wyatt Shipman: I do not understand why it is outputting the title after I closed the <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> tag. I'm matching "body" and selecting only items - why is outputting also the title, that is contained in "head"? Commented Jun 25, 2018 at 19:35
  • PS. now I understand why the title was outputted. I should have matched "html", not "body" Commented Jun 25, 2018 at 19:44

3 Answers 3

3

Actually, XSTL processes XML files, not HTML.

Your source HTML almost meets requirements of well-formed XML. There is only one error: Your meta element is not closed, so I changed it to:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

(adding / before the closing >). Otherwise the XSLT processor displays an error message (at least in my installation).

As far as your XSLT is concerned, I made a few corrections:

  • match="body" changed to match="html",
  • added // in the second xsl:value-of,
  • changed "bare" | to <xsl:text>|</xsl:text>, only for readability reason (longer lines can not be seen on smaller monitors),
  • added <xsl:output method="text"/> as your output does not seem to be any XML.

Last 2 changes are optional, you can ignore them.

So the whole script can be like below:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="html">
    <xsl:value-of select="//item[@itemprop='itemprop1']/@content"/>
    <xsl:text>|</xsl:text>
    <xsl:value-of select="//item[@itemprop='itemprop5']/@content"/>
  </xsl:template>
</xsl:stylesheet>
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Valdi_Bo. I have tried the changes that you suggested and in fact it now works. Thank you also for the 2 optional changes that I didn't know yet. As you are the first person to respond with a correct answer I'm going to select your answer as the correct one, although zx485's answer is also correct.
1

Your main problem using xsltproc is that you're trying to process HTML instead of XML. The difference is in the <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> tag which isn't closed and hence there is no valid XML for the XSLT processor (what results in an error). So add a closing char to make it

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

If you fix this problem and add a template that removes 'non-matching' text() nodes like

<xsl:template match="text()" />

your XSLT will do what you want.

2 Comments

Thank you zx485. I was just about to update my question, because I noticed that my example did not have the closing slash in the meta line. My output however was "My Titlecontent1|content5". I have added the line to remove text() and now my output is correct "content1|content5". I still do not understand why text() is outputted thought. Could you give me a pointer please?
PS I found out why the title was outputted. I should have matched html, not body.
0
<xsl:output method="text" indent="yes"/>
    <xsl:template match="/">
        <xsl:value-of select="html/body/item[@itemprop='itemprop1']/@content"/>|<xsl:value-of select="html/body/item[@itemprop='itemprop5']/@content"/>
    </xsl:template>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.