Update src value using preg_replace

Question

I have some <img> tags like these:

<img alt="" src="{assets_8170:{filedir_14}test.png}" style="width: 700px; height: 181px;" />
<img src="{filedir_14}test.png" alt="" />

And I need to update the src value, extracting the filename and adding it inside a WordPress shortcode:

<img src="[my-shortcode file='test.png']" ... />

The regex to extract the filename is this one: [a-zA-Z_0-9-()]+\.[a-zA-Z]{2,4}, but I am not able to create the complete regex, considering that the image tag attributes do not follow the same order in all instances.

it's not clear to me if you have the html containing those <img> tags as plain text and you wish to produce another text content having the processed text. Or if you are aiming at a process happening at runtime on some platform. Because if the question is limited to regex is not clear what's the subject string. Is it a text containing any html including <img> that you need to process and transform so that the <img> are replaced with their src attribute changed? — Diego D
– Diego D, Commented Jan 24, 2023 at 14:01
@DiegoD the first option: I have a variable containing the HTML as a string and would like to make the changes and save in the database. — marcelo2605
– marcelo2605, Commented Jan 24, 2023 at 14:04
If I got it correctly and you need to interpret and transform an html fragment you have as string, you should use an html parser instead of relying on regular expressions. I made a very simple demo that shows the concept here onlinephp.io/c/92303 .. it would be very weak to post as an answer especially at this stage where the borders are still blurred. Anyway once you have the src value you can process it anyway you prefer (including using regex) — Diego D
– Diego D, Commented Jan 24, 2023 at 14:12
Thanks @DiegoD. I tried it before, but the $dom->saveHTML() return the whole HTML. How can I return only the body content? — marcelo2605
– marcelo2605, Commented Jan 24, 2023 at 14:14
onlinephp.io/c/157de I edited the demo so that now returns the content of its body element (since the DOMDocument object will create an html frame to wrap your input code) — Diego D
– Diego D, Commented Jan 24, 2023 at 14:21

Diego D · Accepted Answer · 2023-01-24 17:06:50Z

1

PHP - Parsing html contents, making transforms and returning the resulting html

The answer grew bigger during its lifecycle trying to address the issue.

Several attempts were made but the latest one (loadXML/saveXML) nailed it.

DOMDocument - loadHTML and saveHTML

If you need to parse an html string in php so that you can later fetch and modify its content in a structured and safe manner without breaking the encoding, you can use DOMDocument::loadHTML():

https://www.php.net/manual/en/domdocument.loadhtml.php

Here I show how to parse your html string, fetch all its <img> elements and for each of them how to retrieve their src attribute and set it with an arbitrary value.

At the end to return the html string of the transformed document, you can use DOMDocument::saveHTML:

https://www.php.net/manual/en/domdocument.savehtml

Taking into account the fact that by default the document will contain the basic html frame wrapping your original content. So to be sure the resulting html will be limited to that part only, here I show how to fetch the body content and loop through its children to return the final composition:

https://onlinephp.io/c/157de

<?php

$html = "
<img alt=\"\" src=\"{assets_8170:{filedir_14}test.png}\" style=\"width: 700px; height: 181px;\" />
<img src=\"{filedir_14}test.png\" alt=\"\" />
";

$transformed = processImages($html);

echo $transformed;

function processImages($html){

    //parse the html fragment
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    
    //fetch the <img> elements
    $images = $dom->getElementsByTagName('img');
    
    //for each <img>
    foreach ($images as $img) {
        //get the src attribute
        $src = $img->getAttribute('src');
        //set the src attribute
        $img->setAttribute('src', 'bogus');
    }
    
    //return the html modified so far (body content only)
    $body = $dom->getElementsByTagName('body')->item(0);
    $bodyChildren = $body->childNodes;
    $bodyContent = '';
    foreach ($bodyChildren as $child) {
        $bodyContent .= $dom->saveHTML($child);
    }
    return $bodyContent;
}

Problems with src attribute value restrictions

After reading on comments you pointed out that saveHTML was returning an html where the image src attribute value had its special characters escaped I made some more research...

The reason why that happens it's because DOMDocument wants to make sure that the src attribute contains a valid url and {,} are not valid characters.

Evidence that it doesn't happen with custom data attributes

For example if I added an attribute like data-test="mycustomcontent: {wildlyusingwhatever}" that one was going to be returned untouched because it didn't require strict rules to adhere to.

Quick fix to make it work (defeating the parser as a whole)

Now to put a fix on that all I could come out with so far was this:

https://onlinephp.io/c/0e334

//VERY UNSAFE -- replace the in $bodyContent %7B as { and %7D as }
$bodyContent = str_replace("%7B", "{", $bodyContent);
$bodyContent = str_replace("%7D", "}", $bodyContent);
return $bodyContent;

But of course it's nor safe nor smart and neither a very good solution. First of all because it defeats the whole purpose of using a parser instead of regex and secondly because it could seriously damage the result.

A better approach using loadXML and saveXML

To prevent the html rules to kick in, it could be attempted the route of parsing the text as XML instead of HTML so that it will still adhere to the nested markdown syntax (difficult/impossible to deal with using regex) but it won't apply all the restrictions about contents.

I modified the core logic by doing this:

//loads the html content as xml wrapping it with a root element
$dom->loadXml("<root>${html}</root>");

//...

//returns the xml content of each children in <root> as processed so far
$rootNode = $dom->childNodes[0];
$children = $rootNode->childNodes;
$content = '';
foreach ($children as $child) {
   $content .= $dom->saveXML($child);
}
    
return $content;

And this is the working demo: https://onlinephp.io/c/f9de1

edited Jan 24, 2023 at 17:06

answered Jan 24, 2023 at 14:26

Diego D

8,3792 gold badges24 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

marcelo2605 Over a year ago

Thanks for the help! I just have one last issue: square brackets and double quotes has been converted (%5B, %22). How can I avoid that?

marcelo2605 Over a year ago

Found a solution: urldecode($bodyContent).

Diego D Over a year ago

yes I know that one.. but it's a fake solution.. you are using it on the whole html string and it's like doing str_replace but addressing every single escaped entity instead of just %7B and %7D. I didn't cite it because I trusted even less if I can't control what you are going to replace. It would be effective if it could be applied ONLY to the src attribute value.. but by design you can't control the raw value of a single attribute/element from the DOMDocument object until the moment you have the final entire html string

Diego D Over a year ago

consider also the chance to open a new question (or edit this one) where you explain the fact that using the HTML parser in php put you in the position to have your src attribute values filled with escape sequences because containing invalid characters and how to consistently avoid it while still using a parser strategy

Diego D Over a year ago

@marcelo2605 in the end I came out with a solution using the xml variant of the story (xml is a superset of html minus the oddities of html when it doesn't strictly adhere to xml). I edited the answer with the final approach and the working demo added

Collectives™ on Stack Overflow

Update src value using preg_replace

1 Answer 1

PHP - Parsing html contents, making transforms and returning the resulting html

DOMDocument - loadHTML and saveHTML

Problems with src attribute value restrictions

Evidence that it doesn't happen with custom data attributes

Quick fix to make it work (defeating the parser as a whole)

A better approach using loadXML and saveXML

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

PHP - Parsing html contents, making transforms and returning the resulting html

DOMDocument - loadHTML and saveHTML

Problems with src attribute value restrictions

Evidence that it doesn't happen with custom data attributes

Quick fix to make it work (defeating the parser as a whole)

A better approach using loadXML and saveXML

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related