Regular expression Remove tags around a specific string

Question

Here is my string:

$str="<p>Some <a href="#">link</a> with <a href="http://whatever.html?bla">LINK2</a> and <a href="http://whatever.html?bla" target="_blank">LINK3</a> and</p> more html"

I would like to remove the links LINK1 and LINK2 using php to get:

"<p>Some <a href="#">link</a> with and and</p> more html"

Here is what I think is close to what I need:

$find = array("<a(.*)LINK1(.*)</a>", "<a(.*)LINK2(.*)</a>");
$replace = array("", "");
$result=preg_replace("$find","$replace",$str);

This isn't working. I have searched for days and tried many other options but never managed to get this to work as expected. Also, I don't really mind if LINK1 and 2 appear as soon as the a tags are removed.

Please refrain from parsing HTML with RegEx as it will drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢. Use an HTML parser instead. — Madara's Ghost
– Madara's Ghost, Commented Jul 28, 2012 at 22:34
Don't use regular expressions to parse HTML. Use a proper HTML parsing module. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See htmlparsing.com/php or this SO thread for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester
– Andy Lester, Commented Aug 29, 2013 at 19:49

alaeus · Accepted Answer · 2012-07-28 12:21:18Z

1

You are very close to a working solution. The problem you are facing is that regular expressions per default try to match as much as possible. The pattern <a(.*)LINK1(.*)</a> will in fact match the first <a to the last </a> if they have LINK1 inbetween. What you want is just to just get the nearest <a> tag.

There are a few ways to do this, but I usually go for making the matching ungreedy. Then it will instead try to find the smallest possible matches. Two ways of doing this is to append a ? after the quantifier or using the ungreedy modifier U. I prefer the first one.

Using ?:

/<a(.*?)LINK1(.*?)<\/a>/

Using modifier:

/<a(.*)LINK1(.*)<\/a>/U

Both should work equally well here. The entire source code will thus be as follows (using ?):

$find = array("/<a(.*?)LINK1(.*?)<\/a>/", "/<a(.*?)LINK2(.*?)<\/a>/");
$replace = array("", "");
$result = preg_replace($find, $replace, $str);

And yeah, as noted in other comments you shouldn't rely on regular expressions for manipulating HTML code (because it is really easy to construct valid HTML code that will go through the expression unnoticed). However, I believe it is perfectly ok if you trust the HTML code that you parse or that the result of this matching is not crucial for other important functions.

edited Jul 28, 2012 at 12:21

answered Jul 28, 2012 at 12:15

alaeus

1716 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Anthony G Over a year ago

Thank you so much for your help and detailed explanation! This seams to work well but you and Lix are saying that I shouldn't use regular expressions so I'm going to look into DOM parsers.. hopefully, it won't be much harder :)

alaeus Over a year ago

It all depends on how you use it. Bad usage: Using it to remove unwanted contents from text coming from web visitors (like a filtering system for blog comments). Ok usage: Using it to do stuff with HTML code that you have written earlier (or another source which impossibly have the intention to hack you). Another semi-ok usage: Scanning through another web page for stuff.

Anthony G Over a year ago

ok Alaeus, my content come from trusted sources only so I should be able to use REGEXP then! Thank you for your comment. Do you guys also know how I could match links that contains "@" and numbers "1"?

alaeus Over a year ago

I'm not sure I understand. /<a(.*?)>[@\d]+<\/a>/ will match links that only contains @ and numbers. Is it that what you were after?

Anthony G Over a year ago

Sorry Alaeus, I just wanted to remove an email address so this do the trick : $find= '[email protected]' My question was stupid, I wasn't using QUOTES that's why I had an error...

heximal · Accepted Answer · 2012-07-28 12:05:49Z

0

try this:

<?php
$str='<p>Some <a href="#">link</a> with <a href="http://whatever.html?bla">LINK2</a> and <a href="http://whatever.html?bla" target="_blank">LINK3</a> and</p> more html';
$find = array("/<a(.*)LINK1(.*)<\/a>/si", "/<a(.*)LINK2(.*)<\/a>/si");
$replace = array("", "");
$result=preg_replace($find, $replace, $str);

answered Jul 28, 2012 at 12:05

heximal

10.5k5 gold badges49 silver badges73 bronze badges

2 Comments

Anthony G Over a year ago

Thanks for your reply, unfortunately this seams to replace much more than just the link

Lix Over a year ago

Parsing HTML content with regular expressions is highly regarded as a bad idea. XML or DOM parsers would be a much better choice.

Collectives™ on Stack Overflow

Regular expression Remove tags around a specific string

2 Answers 2

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related