xpath: extract data from a node using xpath

Question

I want to extract only the sales rank (which in this case is 5)

Amazon Best Sellers Rank: #5 in Books ( See Top 100 in Books )

From web page : http://www.amazon.com/Mockingjay-Hunger-Games-Book-3/dp/0439023513/ref=tmm_hrd_title_0

So far I have gotten down to this, which selects "Amazon Best Sellers Rank:":

//li[@id='SalesRank']/b/text()

I am using PHP DOMDocument and DOMXPath.

this is what i have till now : //li[@id='SalesRank']/b/text() — Abhi
– Abhi, Commented Jan 19, 2012 at 8:46

Francis Avila · Accepted Answer · 2012-01-21 03:54:24Z

2

You can use pure XPath:

substring-before(normalize-space(/html/body//ul/li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")

However, if your input is a bit messy you might get more reliable results by using XPath to grab the parent node's text, and then using a regex on the text to get the specific thing you want.

Demonstration of both methods using PHP with DOMDocument and DOMXPath:

// Method 1: XPath only
$xp_salesrank = 'substring-before(normalize-space(/html/body//li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")';

// Method 2: XPath and Regex
$regex_ranktext = 'string(/html/body//li[@id="SalesRank"])';
$regex_salesrank = '/Best\s+Sellers\s+Rank:\s*(#\d+)\s+/ui';

// Test URLs
$urls = array(
    'http://rads.stackoverflow.com/amzn/click/0439023513',
    'http://www.amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/ref=tmm_kin_title_0?ie=UTF8&m=AG56TWVU5XWC2',
);

// Results
$ranks = array();
$ranks_regex = array();

foreach ($urls as $url) {
    $d = new DOMDocument();
    $d->loadHTMLFile($url);
    $xp = new DOMXPath($d);

    // Method 1: use pure xpath
    $ranks[] = $xp->evaluate($xp_salesrank);

    // Method 2: use xpath to get a section of text, then regex for more specific item
    // This method is probably more forgiving of bad HTML.
    $rank_regex = '';
    $ranktext = $xp->evaluate($regex_ranktext);
    if ($ranktext) {
        if (preg_match($regex_salesrank, $ranktext, $matches)) {
            $rank_regex = $matches[1];
        }
    }
    $ranks_regex[] = $rank_regex;

}

assert($ranks===$ranks_regex); // Both methods should be the same.
var_dump($ranks);
var_dump($ranks_regex);

The output I get is:

array(2) {
  [0]=>
  string(2) "#4"
  [1]=>
  string(2) "#3"
}
array(2) {
  [0]=>
  string(2) "#4"
  [1]=>
  string(2) "#3"
}

edited Jan 21, 2012 at 3:54

answered Jan 19, 2012 at 7:51

Francis Avila

31.8k7 gold badges63 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Abhi Over a year ago

Thanks Francis. however it is throwing following error: PHP Warning: DOMXPath::query() [<a href='domxpath.query'>domxpath.query</a>]: Invalid expression

Francis Avila Over a year ago

Check your copy-pasting, because it clearly works. See new code.

Abhi Over a year ago

Tried using the exact code but it fetches null for this page : amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/… (when it should actually fetch '3')

Francis Avila Over a year ago

If you remove ul/ it works on this page too. Amazon's html is pretty bad so it looks like different pages produce inconsistent DOMs in libxml2's html parser. Since you have a PHP host language it might be better to get the text value of the parent node and then get the actual number with a regex instead of doing it all in XPath.

Abhi Over a year ago

removing ul/ did not help :( the regex route : tried substr($homepage, strpos($homepage, 'Paid in Kindle Store'),-10); but its bringing back the compltete html

|

Dimitre Novatchev · Accepted Answer · 2012-01-19 14:23:19Z

0

Use:

substring-before(substring-after($expr, '#'), ' ')

where $expr should be substituted by your expression:

   substring-before(substring-after(//li[@id='SalesRank']/b, '#'), ' ')

Or, if the right expression that selects the text node is (as per @FrancisAvila):

/html/body//ul/li[@id="SalesRank"]/b[1]/following-sibling::text()[1]

then the above becomes:

substring-before(
   substring-after(/html/body//ul/li[@id="SalesRank"]
                  /b[1]/following-sibling::text()[1], '#'), 
   ' ')

edited Jan 19, 2012 at 14:23

answered Jan 19, 2012 at 13:52

Dimitre Novatchev

244k27 gold badges307 silver badges438 bronze badges

5 Comments

Francis Avila Over a year ago

The text he desires is not a child of //li[@id='SalesRank']/b but a following-sibling.

Dimitre Novatchev Over a year ago

@FrancisAvila: But he said otherwise... Nevermind, he just needs to substitute the right expression for $expr

Abhi Over a year ago

does not work either way. fetches null tried it on this page : amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/…

Dimitre Novatchev Over a year ago

@Abhi: If so, then you have misled the readers that //li[@id='SalesRank']/b selects the element, from whose string value you want to extract data. You must provide an example XML, so that any XPath expression could be verified.

Abhi Over a year ago

apologies if i have confused you. I dont have an XML I am trying to extract the Sales Rank of a book from Amazon's site: <b>Amazon Best Sellers Rank:</b> #3 Paid in Kindle Store ( <a href="amazon.com/gp/bestsellers/digital-text/… Top 100 Paid in Kindle Store</a> The issue is that the value that i want to extract is not part of any node... it is outside of the nodes. check the Sales rank sectio on this page amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/…

Collectives™ on Stack Overflow

xpath: extract data from a node using xpath

2 Answers 2

6 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related