1

I want to extract only the sales rank (which in this case is 5)

Amazon Best Sellers Rank: #5 in Books ( See Top 100 in Books )

From web page : http://www.amazon.com/Mockingjay-Hunger-Games-Book-3/dp/0439023513/ref=tmm_hrd_title_0

So far I have gotten down to this, which selects "Amazon Best Sellers Rank:":

//li[@id='SalesRank']/b/text()

I am using PHP DOMDocument and DOMXPath.

2
  • pls provide what you have tried so far. Commented Jan 19, 2012 at 8:31
  • this is what i have till now : //li[@id='SalesRank']/b/text() Commented Jan 19, 2012 at 8:46

2 Answers 2

2

You can use pure XPath:

substring-before(normalize-space(/html/body//ul/li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")

However, if your input is a bit messy you might get more reliable results by using XPath to grab the parent node's text, and then using a regex on the text to get the specific thing you want.

Demonstration of both methods using PHP with DOMDocument and DOMXPath:

// Method 1: XPath only
$xp_salesrank = 'substring-before(normalize-space(/html/body//li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")';

// Method 2: XPath and Regex
$regex_ranktext = 'string(/html/body//li[@id="SalesRank"])';
$regex_salesrank = '/Best\s+Sellers\s+Rank:\s*(#\d+)\s+/ui';

// Test URLs
$urls = array(
    'http://rads.stackoverflow.com/amzn/click/0439023513',
    'http://www.amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/ref=tmm_kin_title_0?ie=UTF8&m=AG56TWVU5XWC2',
);

// Results
$ranks = array();
$ranks_regex = array();

foreach ($urls as $url) {
    $d = new DOMDocument();
    $d->loadHTMLFile($url);
    $xp = new DOMXPath($d);

    // Method 1: use pure xpath
    $ranks[] = $xp->evaluate($xp_salesrank);

    // Method 2: use xpath to get a section of text, then regex for more specific item
    // This method is probably more forgiving of bad HTML.
    $rank_regex = '';
    $ranktext = $xp->evaluate($regex_ranktext);
    if ($ranktext) {
        if (preg_match($regex_salesrank, $ranktext, $matches)) {
            $rank_regex = $matches[1];
        }
    }
    $ranks_regex[] = $rank_regex;

}

assert($ranks===$ranks_regex); // Both methods should be the same.
var_dump($ranks);
var_dump($ranks_regex);

The output I get is:

array(2) {
  [0]=>
  string(2) "#4"
  [1]=>
  string(2) "#3"
}
array(2) {
  [0]=>
  string(2) "#4"
  [1]=>
  string(2) "#3"
}
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks Francis. however it is throwing following error: PHP Warning: DOMXPath::query() [<a href='domxpath.query'>domxpath.query</a>]: Invalid expression
Check your copy-pasting, because it clearly works. See new code.
Tried using the exact code but it fetches null for this page : amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/… (when it should actually fetch '3')
If you remove ul/ it works on this page too. Amazon's html is pretty bad so it looks like different pages produce inconsistent DOMs in libxml2's html parser. Since you have a PHP host language it might be better to get the text value of the parent node and then get the actual number with a regex instead of doing it all in XPath.
removing ul/ did not help :( the regex route : tried substr($homepage, strpos($homepage, 'Paid in Kindle Store'),-10); but its bringing back the compltete html
|
0

Use:

substring-before(substring-after($expr, '#'), ' ')

where $expr should be substituted by your expression:

   substring-before(substring-after(//li[@id='SalesRank']/b, '#'), ' ')

Or, if the right expression that selects the text node is (as per @FrancisAvila):

/html/body//ul/li[@id="SalesRank"]/b[1]/following-sibling::text()[1]

then the above becomes:

substring-before(
   substring-after(/html/body//ul/li[@id="SalesRank"]
                  /b[1]/following-sibling::text()[1], '#'), 
   ' ')

5 Comments

The text he desires is not a child of //li[@id='SalesRank']/b but a following-sibling.
@FrancisAvila: But he said otherwise... Nevermind, he just needs to substitute the right expression for $expr
does not work either way. fetches null tried it on this page : amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/…
@Abhi: If so, then you have misled the readers that //li[@id='SalesRank']/b selects the element, from whose string value you want to extract data. You must provide an example XML, so that any XPath expression could be verified.
apologies if i have confused you. I dont have an XML I am trying to extract the Sales Rank of a book from Amazon's site: <b>Amazon Best Sellers Rank:</b> #3 Paid in Kindle Store ( <a href="amazon.com/gp/bestsellers/digital-text/… Top 100 Paid in Kindle Store</a> The issue is that the value that i want to extract is not part of any node... it is outside of the nodes. check the Sales rank sectio on this page amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.