0

I'm trying to fetch the content of a div in a html page using xpath and domdocument. This is the structure of the page:

<div id="content">
<div class="div1"></div>
<span class="span1></span>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div class="div2"></div>
</div>

I want to get only the content of p, not spans and divs. I came thru this xpath expression .//*[@id='content']/p but guess something's not right because i'm getting only the first p. Tried using other expression with following-sibling and node() but all return the first p only.

.//*[@id='content']/span/following-sibling::p
.//*[@id='content']/node()[self::p]

This is how's used xpath:

$domDocument=new DOMDocument();
$domDocument->encoding = 'UFT8';
$domDocument->loadHTML($page);
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
$content = $this->GetHTMLFromDom($domNodeList);

And this is how i get html from nodes:

private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);   
 foreach($node->childNodes as $childNode)
 $domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}

1 Answer 1

2

This XPath expression:

//div[@id='content']/p

Result in the wanted node set (five p elements)

EDIT: Now it's clear what is your problem. You need to iterate over the NodeList:

private function GetHTMLFromDom($domNodeList){ 
   $domDocument = new DOMDocument(); 
   foreach ($nodelist as $node) {
      $domDocument->appendChild($domDocument->importNode($node, true)); 
   }
   return $domDocument->saveHTML(); 
} 
Sign up to request clarification or add additional context in comments.

7 Comments

@Alejandro: thanks for the answer but //div[@id='content']/p dont works for me, i get only the firts p.
@Luciano: Then the problem lies somewhere else in your code. Try after query this: echo $domNodeList->length
@Alejandro: the number of nodes is right, but i still get the first p only. Could it be an error given by tidy() function. I get the content of the page with curl, but then i parse it with $tidy->parseString($curl_res); $tidy->cleanRepair(); return $tidy; Finally i send the this value as $page to domdocument.
@Alejandro: I've tried excluding tidy(), passing to domdocument the content i get with curl, but seems the same thing... is this the right way to use domdocument? (i've updated my question...)
@Luciano: Now with your remaining code it's clear what is your problem. Check my edit.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.