extract img tag from a html code string through preg_match_all PHP function

Question

I've some html code and extracted the img src attribute from it. Into the html string there are some img like this:

<img src="http://www.pecso.it/wp-content/uploads/2016/12/10_WRAS.png">

I've tried to do this with the following PHP code:

$description = wpautop($this->data->description);
$description = preg_replace("/\[[^\]]+\]/", '', $description);
     if (preg_match_all("<img src=(.*?)>", $description, $match)) {
          echo match;
            };

and the result is NULL.

Can you help me, please?

you wanna take image link or coplete of <img src="pecso.it/wp-content/uploads/2016/12/10_WRAS.png"> this — Sercan REYHANLI
– Sercan REYHANLI, Commented Dec 27, 2016 at 14:28
@federkun can you write the complete text of the call function, please? — marco baldini
– marco baldini, Commented Dec 27, 2016 at 14:33

cb0 · Accepted Answer · 2016-12-27 16:25:19Z

Do not use regex on html!

Use a dom parser instead as it is much more hassle free.

$html = file_get_contents("you_file.html");

$dom  = new \DOMDocument();
$dom->loadHTML($html);

$dom->preserveWhiteSpace = false;

$images = [];
foreach ($dom->getElementsByTagName('img') as $image) {
    $images[] = $image->getAttribute('src');
}

Edit:

You are using the wpautop function to clean up the description. According to the documetation it requires the The text to be formatted. as first argument. So first make sure that it does preserve the image tags inside the argument.

As I assume that tags are preserved. Looking at the regex itself, I see that it's matching too little.

You are matching .*? inside the capuring group. The ? indicates to use lazy matching, which means match as few characters as needed. So .* will match any character, zero or more. And ? will match as few as needed.

In my ouptut of var_dump for $match I see that it found a match.

array (size=2)   0 => 
    array (size=1)
      0 => string 'img src=' (length=8)   1 => 
    array (size=1)
      0 => string '' (length=0)

However the first matching group is of size 0. Because of the lazy matching. And I assume and internal php error. It should match everthing up to > because this is also part of the regex. But it seems php is ignoring this part.

If you change the capturing group to .+?, the first group will contain a single " character. Because of the + which means "one or more" characters.

A solution would be to change the code so it includes the quotation marks.

if (preg_match_all("<img src=\"(.*?)\">", $description, $match)) {

This matches the desired image link:

http://www.pecso.it/wp-content/uploads/2016/12/10_WRAS.png

I would recommend try using the DOMDocument approach as it's more likely this code will be more stable and extendable. If you want to learn about regex, parsing html might not be the best thing to start with.

All this code was tested using php 5.4, it might be diffrent for newer versions!

Collectives™ on Stack Overflow

extract img tag from a html code string through preg_match_all PHP function

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related