PHP: Fetch content from a html page using xpath()

Question

I'm trying to fetch the content of a div in a html page using xpath and domdocument. This is the structure of the page:

<div id="content">
<div class="div1"></div>
<span class="span1></span>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div class="div2"></div>
</div>

I want to get only the content of p, not spans and divs. I came thru this xpath expression .//*[@id='content']/p but guess something's not right because i'm getting only the first p. Tried using other expression with following-sibling and node() but all return the first p only.

.//*[@id='content']/span/following-sibling::p
.//*[@id='content']/node()[self::p]

This is how's used xpath:

$domDocument=new DOMDocument();
$domDocument->encoding = 'UFT8';
$domDocument->loadHTML($page);
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
$content = $this->GetHTMLFromDom($domNodeList);

And this is how i get html from nodes:

private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);   
 foreach($node->childNodes as $childNode)
 $domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}

score 2 · Accepted Answer · 2010-10-15 13:06:48Z

2

This XPath expression:

//div[@id='content']/p

Result in the wanted node set (five p elements)

EDIT: Now it's clear what is your problem. You need to iterate over the NodeList:

private function GetHTMLFromDom($domNodeList){ 
   $domDocument = new DOMDocument(); 
   foreach ($nodelist as $node) {
      $domDocument->appendChild($domDocument->importNode($node, true)); 
   }
   return $domDocument->saveHTML(); 
}

edited Oct 15, 2010 at 13:06

answered Oct 14, 2010 at 18:30

user357812

Sign up to request clarification or add additional context in comments.

7 Comments

Luciano Over a year ago

@Alejandro: thanks for the answer but //div[@id='content']/p dont works for me, i get only the firts p.

user357812 Over a year ago

@Luciano: Then the problem lies somewhere else in your code. Try after query this: echo $domNodeList->length

Luciano Over a year ago

@Alejandro: the number of nodes is right, but i still get the first p only. Could it be an error given by tidy() function. I get the content of the page with curl, but then i parse it with $tidy->parseString($curl_res); $tidy->cleanRepair(); return $tidy; Finally i send the this value as $page to domdocument.

Luciano Over a year ago

@Alejandro: I've tried excluding tidy(), passing to domdocument the content i get with curl, but seems the same thing... is this the right way to use domdocument? (i've updated my question...)

user357812 Over a year ago

@Luciano: Now with your remaining code it's clear what is your problem. Check my edit.

|

Collectives™ on Stack Overflow

PHP: Fetch content from a html page using xpath()

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related