Correct syntax of XPath function 'substring-after' for html that selects only substring of all nodes?

Question

I need a x path that selects only sub-string of all nodes. I have be using this x-path but selects all text instead of sub string.

//span[@class="feed-date"]/text()[substring-after(., "on ")]

Html I have: I am willing to extract only date after 'Published on'

<span class="feed-date">Published on 2016-07-07</span>
<span class="feed-date">Published on 2015-02-23</span>
<span class="feed-date">Published on 2014-11-13</span>
<span class="feed-date">Published on 2014-04-28</span>

I found this link that say you can do it in xml

But I can't do it with html. Is there any way to achieve this?

Martin Honnen · Accepted Answer · 2016-08-24 10:24:14Z

2

In XPath 2.0 and later respectively XQuery 1.0 and later or XSLT 2.0 and later you can use //span[@class = 'feed-date']/substring-after(., 'on ') to get a sequence of string values. With XPath 1.0 that functionality does not exist, you would need to iterate all your span elements in a host language and extract the string for each span.

As for using XPath 2.0 with HTMLAgilityPack, it looks as if that is possible making use of https://github.com/StefH/XPath2.Net which is also available on NuGet, that way the Microsoft XPathNavigator gets various extension methods like XPath2Evaluate which then allow you to use XPath 2.0 functions both on an XPathNavigator created from Microsoft's XPathDocument as well as the HTMLAgilityPack's HtmlDocument.

Here is an example:

using System;
using System.Xml.XPath;
using Wmhelp.XPath2;
using HtmlAgilityPack;

namespace XPath20Net1
{
    class Program
    {
        static void Main(string[] args)
        {
            XPathNavigator nav = new XPathDocument("XMLFile1.xml").CreateNavigator();
            Console.WriteLine(nav.XPath2Evaluate("string-join(//span[@class = 'feed-date']/substring-after(., 'on '), ' ')"));

            HtmlDocument doc = new HtmlDocument();
            doc.Load("HTMLPage1.html");

            Console.WriteLine(doc.CreateNavigator().XPath2Evaluate("string-join(//span[@class = 'feed-date']/substring-after(., 'on '), ' ')"));
        }
    }
}

With the the XML document being

<?xml version="1.0" encoding="utf-8" ?>
<html>
  <body>
    <span class="feed-date">Published on 2016-07-07</span>
    <span class="feed-date">Published on 2015-02-23</span>
    <span class="feed-date">Published on 2014-11-13</span>
    <span class="feed-date">Published on 2014-04-28</span>
  </body>
</html>

and the HTML document being

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <title>Test</title>
</head>
<body>
 <p id=test>

         <span class="feed-date">Published on 2016-07-07</span>
         <span class="feed-date">Published on 2015-02-23</span>
         <span class="feed-date">Published on 2014-11-13</span>
         <span class="feed-date">Published on 2014-04-28</span>

</body>
</html>

then output is

2016-07-07 2015-02-23 2014-11-13 2014-04-28
2016-07-07 2015-02-23 2014-11-13 2014-04-28

edited Aug 24, 2016 at 10:24

answered Aug 24, 2016 at 9:38

Martin Honnen

169k6 gold badges100 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nithin B Over a year ago

I am using firebug to select in browser. And HtmlAglityPack in coding does both dose not use Xpath 2.0?

Martin Honnen Over a year ago

Browsers have not made any attempt to support XPath 2.0. And HTMLAgilityPack makes use of the Microsoft .NET XPathNavigator infrastructure which also only implements XPath 1.0. So in that context you would need to select all //span[@class = 'feed-date'] elements first and then use substring-after(., 'on ') on each selected span.

Martin Honnen Over a year ago

@NithinB, I have edited my answer considerably to show how to use an XPath 2.0 library available with NuGet together with HTMLAgilityPack to use a single XPath expression string-join(//span[@class = 'feed-date']/substring-after(., 'on '), ' ') to select a string with all date values.

Nithin B Over a year ago

that was very helpful. Thanks for the solution

Collectives™ on Stack Overflow

Correct syntax of XPath function 'substring-after' for html that selects only substring of all nodes?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related