4

I tried to follow some questions here about preg_match and DOM, but everything just flew over my head.

I have a string like this:

$string = '<td class="borderClass" width="225" style="border-width: 0 1px 0 0;" valign="top">
<div style="text-align: center;">
    <a href="http://myanimelist.net/anime/10800/Chihayafuru/pic&pid=35749">
    <img src="http://cdn.myanimelist.net/images/anime/3/35749.jpg" alt="Chihayafuru" align="center">
    </a>
</div>';

I'm now trying to get the image src attribute value from it. I tried using this code, but I can't figure out what I'm doing wrong.

$doc = new DOMDocument();
$dom->loadXML( $string );
$imgs = $dom->query("//img");
for ($i=0; $i < $imgs->length; $i++) {
    $img = $imgs->item($i);
    $src = $img->getAttribute("src");
}
$scraped_img = $src;

How may I get the image src attribute from this using php?

1
  • 5
    Use DOM. Anything else suggesting regexes or string operations is essentially wrong. Commented Oct 11, 2013 at 16:59

3 Answers 3

6

Here is the corrected code, that you can use:

$string = '<td class="borderClass" width="225" style="border-width: 0 1px 0 0;" valign="top">
<div style="text-align: center;">
    <a href="http://myanimelist.net/anime/10800/Chihayafuru/pic&pid=35749">
    <img src="http://cdn.myanimelist.net/images/anime/3/35749.jpg" alt="Chihayafuru" align="center">
    </a>
</div>';

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML( $string );
$xpath = new DOMXPath($doc);
$imgs = $xpath->query("//img");
for ($i=0; $i < $imgs->length; $i++) {
    $img = $imgs->item($i);
    $src = $img->getAttribute("src");
}

echo $src;

OUTPUT

http://cdn.myanimelist.net/images/anime/3/35749.jpg
Sign up to request clarification or add additional context in comments.

Comments

2

We have found while writing Drupal that using SimpleXML is much easier than dealing with the DOM:

$htmlDom = new \DOMDocument();
@$htmlDom->loadHTML('<?xml encoding="UTF-8">' . $string);
$elements = simplexml_import_dom($htmlDom);
print $elements->body->td[0]->div[0]->a[0]->img[0]['src'];

This allows you load whatever HTML soup because the DOM is more forgiving than simplexml and at the same time allows using the simple and powerful simplexml extension.

The first three lines are copied verbatin out of the Drupal testing framework -- it's truly battle hardened code.

1 Comment

Thank you for your nice answer, @chx. Your code has it's great applications. But I believe anubhava's answer is what I was looking for you. It's sad that I could upvote your answer only once. :)
0
    $html = '<td class="borderClass" width="225" style="border-width: 0 1px 0 0;" valign="top">
<div style="text-align: center;">
    <a href="http://myanimelist.net/anime/10800/Chihayafuru/pic&pid=35749">
    <img src="http://cdn.myanimelist.net/images/anime/3/35749.jpg" alt="Chihayafuru" align="center">
    </a>
</div>';

    $dom = new DOMDocument();
    $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
    libxml_use_internal_errors(FALSE);
    $xpath = new DOMXPath($dom);
    /** @var \DOMNodeList $images_dom_list */
    $images_dom_list = $xpath->query('//img');
    /** @var \DOMElement $image_dom_element */
    foreach ($images_dom_list as $image_dom_element) {
      $src = $image_dom_element->getAttribute('src');
      // Do what you want.
      $src = '//google.com/image.jpg';
      $image_dom_element->setAttribute('src', $src);
    }

    $updated_html_string = $xpath->document->saveHTML();

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.