1

I have a string containing also HTML in a $html variable:

'Here is some <a href="#">text</a> which I do not need to extract but then there are 
<figure class="class-one">
    <img src="/example.jpg" alt="example alt" class="some-image-class">
    <figcaption>example caption</figcaption>
</figure>

And another one (and many more)
<figure class="class-one some-other-class">
    <img src="/example2.jpg" alt="example2 alt">
</figure>'

I want to extract all <figure> elements and everything they contain including their attributes and other html-elements and put this in an array in PHP so I would get something like:

    $figures = [
        0 => [
            "class" => "class-one",
            "img" => [
                "src" => "/example.jpg",
                "alt" => "example alt",
                "class" => "some-image-class"
            ],
            "figcaption" => "example caption"
        ],
        1 => [
            "class" => "class-one some-other-class",
            "img" => [
                "src" => "/example2.jpg",
                "alt" => "example2 alt",
                "class" => null
            ],
            "figcaption" => null
        ]];

So far I have tried:

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$figures = array();
foreach ($figures as $figure) {
    $figures['class'] = $figure->getAttribute('class');
    // here I tried to create the whole array but I can't seem to get the values from the HTML 
    // also I'm not sure how to get all html-elements within <figure>   
} 

Here is a Demo.

2
  • You are overwriting the $figures variable before the loop. Commented Jun 16, 2019 at 11:35
  • @msg, that's just an example of what I wish to get. Commented Jun 16, 2019 at 12:46

2 Answers 2

4

Here is the code that should get you where you want to be. I have added comments where I felt they would be helpful:

<?php

$htmlString = 'Here is some <a href="#">text</a> which I do not need to extract but then there are <figure class="class-one"><img src="/example.jpg" alt="example alt" class="some-image-class"><figcaption>example caption</figcaption></figure>And another one (and many more)<figure class="class-one some-other-class"><img src="/example2.jpg" alt="example2 alt"></figure>';

//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML.
@$dom->loadHTML($htmlString);

//Create new XP
$xp = new DOMXpath($dom);

//Create empty figures array that will hold all of our parsed HTML data
$figures = array();

//Get all <figure> elements
$figureElements = $xp->query('//figure');

//Create number variable to keep track of our $figures array index
$figureCount = 0;

//Loop through each <figure> element
foreach ($figureElements as $figureElement) {
    $figures[$figureCount]["class"] = trim($figureElement->getAttribute('class'));
    $figures[$figureCount]["img"]["src"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('src');
    $figures[$figureCount]["img"]["alt"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('alt');

    //Check that an img class exists, otherwise set the value to null. If we don't do this PHP will throw a NOTICE.
    if (boolval($xp->evaluate('//img', $figureElement)->item($figureCount))) {
        $figures[$figureCount]["img"]["class"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('class');
    } else {
        $figures[$figureCount]["img"]["class"] = null;
    }

    //Check that a <figcaption> element exists, otherwise set the value to null
    if (boolval($xp->evaluate('//figcaption', $figureElement)->item($figureCount))) {
        $figures[$figureCount]["figcaption"] = $xp->query('//figcaption', $figureElement)->item($figureCount)->nodeValue;
    } else {
        $figures[$figureCount]["figcaption"] = null;
    }

    //Increment our $figureCount so that we know we can create a new array index.
    $figureCount++;
}

print_r($figures);
?>
Sign up to request clarification or add additional context in comments.

2 Comments

Those comments are very helpful indeed. I have just one more question about this solution: is there a benefit to using DOMXpath instead of only using DOMDocument to get all the values?
Glad to help! Yes, I use Xpath so that I can easily target the HTML elements that I’m looking to parse, as well as check if the children elements and attributes actually exist using Xpath’s evaluate.
2
 $doc = new \DOMDocument();
      $doc->loadHTML($html);

      $figure = $doc->getElementsByTagName("figure"); // DOMNodeList Object

      //Craete array to add all DOMElement value
      $figures = array();
      $i= 0;
      foreach($figure as $item) { // DOMElement Object

        $figures[$i]['class']= $item->getAttribute('class');
        //DOMElement::getElementsByTagName— Returns html tag
        $img = $item->getElementsByTagName('img')[0];

        if($img){
            //DOMElement::getAttribute — Returns value of attribute
            $figures[$i]['img']['src'] = $img->getAttribute('src');

            $figures[$i]['img']['alt'] = $img->getAttribute('alt');
            $figures[$i]['img']['class'] = $img->getAttribute('class');
        }
        //textContent - use to get the text of tag
        if($item->getElementsByTagName('figcaption')[0]){
            $figures[$i]['figcaption'] = $item->getElementsByTagName('figcaption')[0]->textContent;
        }

        $i++;
      }

      echo "<pre>";
      print_r($figures);
      echo "</pre>";

2 Comments

I do an error for Trying to get property 'textContent' of non-object, because there is no check if the property actually exists.
yes, that was warning. I added the condition to check the object.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.