2

I've some html code and extracted the img src attribute from it. Into the html string there are some img like this:

<img src="http://www.pecso.it/wp-content/uploads/2016/12/10_WRAS.png">

I've tried to do this with the following PHP code:

$description = wpautop($this->data->description);
$description = preg_replace("/\[[^\]]+\]/", '', $description);
     if (preg_match_all("<img src=(.*?)>", $description, $match)) {
          echo match;
            };

and the result is NULL.

Can you help me, please?

7
  • 1
    use Dom parser instead Commented Dec 27, 2016 at 14:25
  • 2
    /<img src=\"(.*?)\">/ Commented Dec 27, 2016 at 14:27
  • you wanna take image link or coplete of <img src="pecso.it/wp-content/uploads/2016/12/10_WRAS.png"> this Commented Dec 27, 2016 at 14:28
  • @ivan I want to take src of every img tag Commented Dec 27, 2016 at 14:30
  • @federkun can you write the complete text of the call function, please? Commented Dec 27, 2016 at 14:33

1 Answer 1

1

Do not use regex on html!

Use a dom parser instead as it is much more hassle free.

$html = file_get_contents("you_file.html");

$dom  = new \DOMDocument();
$dom->loadHTML($html);

$dom->preserveWhiteSpace = false;

$images = [];
foreach ($dom->getElementsByTagName('img') as $image) {
    $images[] = $image->getAttribute('src');
}

Edit:

You are using the wpautop function to clean up the description. According to the documetation it requires the The text to be formatted. as first argument. So first make sure that it does preserve the image tags inside the argument.

As I assume that tags are preserved. Looking at the regex itself, I see that it's matching too little.

You are matching .*? inside the capuring group. The ? indicates to use lazy matching, which means match as few characters as needed. So .* will match any character, zero or more. And ? will match as few as needed.

In my ouptut of var_dump for $match I see that it found a match.

array (size=2)   0 => 
    array (size=1)
      0 => string 'img src=' (length=8)   1 => 
    array (size=1)
      0 => string '' (length=0)

However the first matching group is of size 0. Because of the lazy matching. And I assume and internal php error. It should match everthing up to > because this is also part of the regex. But it seems php is ignoring this part.

If you change the capturing group to .+?, the first group will contain a single " character. Because of the + which means "one or more" characters.

A solution would be to change the code so it includes the quotation marks.

if (preg_match_all("<img src=\"(.*?)\">", $description, $match)) {

This matches the desired image link:

http://www.pecso.it/wp-content/uploads/2016/12/10_WRAS.png

I would recommend try using the DOMDocument approach as it's more likely this code will be more stable and extendable. If you want to learn about regex, parsing html might not be the best thing to start with.

All this code was tested using php 5.4, it might be diffrent for newer versions!

Sign up to request clarification or add additional context in comments.

1 Comment

DomDocument is the right way when dealing with HTML.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.