2

I have the following HTML markup

<div contenteditable="true" class="text"></div>
<div contenteditable="true" class="text"></div>
<div style="display: block;" class="ui-draggable">
    <img class='avatar' src=""/>
    <p style="">
    <img class='pic' src=""/><br>
    <span class='fulltext' style="display:none"></span>
    </p>-<span class='create'></span>
    <a class='permalink' href=""></a>
    </div>
 <div contenteditable="true" class="text"></div>
 <div style="display: block;" class="ui-draggable">
    <img class='avatar' src=""/>
    <p style="">
    <img class='pic' src=""/><br>
    <span class='fulltext' style="display:none"></span>
    </p><span class='create'></span><a class='permalink' href=""></a>
    </div>

The parent div's can be more.In order to parse the information and to insert it in the DB I'm using the following code -

$dom = new DOMDocument();
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div');
$i=0;
$q=1;
foreach($div as $book) {
    $attr = $book->getAttribute('class');
    //if div contenteditable
    if($attr == 'text') {
        echo '</br>'.$book->nodeValue."</br>";  
    }
    
    else {
        $new = new DOMDocument();
        $newxpath = new DOMXPath($new);
        $avatar = $xpath->query("(//img[@class='avatar']/@src)[$q]");
        
        $picture = $xpath->query("(//p/img[@class='pic']/@src)[$q]");
        $fulltext = $xpath->query("(//p/span[@class='fulltext'])[$q]");
        $permalink = $xpath->query("(//a[@class='permalink'])[$q]");
        echo $permalink->item(0)->nodeValue; //date
        echo $permalink->item(0)->getAttribute('href');
        echo $fulltext->item(0)->nodeValue;
        echo $avatar->item(0)->value;
        echo $picture->item(0)->value;
        $q++;
    }
    $i++;
}

But I think that there's a better way for parsing the HTML. Is there? Thank you in advance

2
  • 1
    $avatar = $avatar; is useless Commented Feb 22, 2013 at 12:03
  • yeah, I've missed that. Thanks Commented Feb 22, 2013 at 12:06

2 Answers 2

5

Note that DOMXPath::query supports a second param called contextparam. Also you won't need a second DOMDocument and DOMXPath inside the loop. Use:

$avatar = $xpath->query("img[@class='avatar']/@src", $book);

to get <img src=""> attribute nodes relative to the div nodes. If you follow my advices your example should be fine.


Here comes a version of your code that follows the above said:

$dom = new DOMDocument();
$dom->loadHTML($xml);

$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div');

foreach($divs as $book) {
    $attr = $book->getAttribute('class');
    if($attr == 'text') {
        echo '</br>'.$book->nodeValue."</br>";  
    } else {
        $avatar = $xpath->query("img[@class='avatar']/@src", $book);
        $picture = $xpath->query("p/img[@class='pic']/@src", $book);
        $fulltext = $xpath->query("p/span[@class='fulltext']", $book);
        $permalink = $xpath->query("a[@class='permalink']", $book);
        echo $permalink->item(0)->nodeValue; //date
        echo $permalink->item(0)->getAttribute('href');
        echo $fulltext->item(0)->nodeValue;
        echo $avatar->item(0)->value;
        echo $picture->item(0)->value;
    }
}
Sign up to request clarification or add additional context in comments.

4 Comments

"Trying to get property of non-object" - echo $picture->.., echo $fulltext->..
Can you post the full HTML to pastebin?
Perfect. Thank you very much. One last question - what's the difference between nodeValue, value and textValue ?
In the example above you are sometimes selecting DOMElement nodes -> nodeValue, DOMAttribute nodes -> value.. I'm unsure about textValue. Expecting it the value of a DOMTextNode or the textual, flattened representation of a DOMElementNode's childs
0

As a matter of fact, you do it the right way : html has to be parsed with a DOM object. Then some optimisation can be brough :

$div = $xpath->query('//div');

is quite greedy, a getElementsByTagName should be more appropriate :

$div = $dom->getElementsByTagName('div');

3 Comments

I have doubts about the use of $q
@artragis Note that both statements will return the same value. In any case.
getElementsByTagName is bufferized so it is less greedy in memory. Let me find the message on @internals list and show it to you as an evidence.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.