php preg_replace does not recognize the pattern

Question

I am trying to make a so called text cleaner so that I could get rid of a few html elements without using the strip_tags() function.

My regex looks like this: <em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>| |<table[^>]*>(.*?)</table[^>]*>

My code looks like this:

$string = "some very messy string here ";
$pattern = '<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>';
$replace = ' ';

$clean =  preg_replace($pattern, $replace, $string);

echo $clean;

For reasons that are beyond my understanding the echo returns nothing.

Thank you for your time

UPDATE #1

If you are asking if I want to get rid of the tables with all the content inside them the answer is yes.

what is the objective of this code - why do you want to avoid using strip_tags? — AD7six
– AD7six, Commented Oct 13, 2012 at 14:52
Strip tags would not delete the content of tables which I would like to do. — Mike
– Mike, Commented Oct 13, 2012 at 14:55
You're better off not using a regex to pseudo-parse html. strip tags will strip tags, and if you want to remove tables - write a routine to remote tables. you're going to get weird results with e.g.: "<table>...<table>...</table>...</table>". — AD7six
– AD7six, Commented Oct 13, 2012 at 15:00
He would have to run the replacement multiple times to get rid of nested tables. — Martin Ender
– Martin Ender, Commented Oct 13, 2012 at 15:02
@m.buettner wouldn't work, after running it the first time the input string would be "before table string...</table>after table string" there would be no <table> to match, a subsequent pass would not remove it. relevant stackoverflow.com/a/1732454/761202 — AD7six
– AD7six, Commented Oct 13, 2012 at 15:05

Martin Ender · Accepted Answer · 2012-10-13 15:13:21Z

4

Your regular expression needs delimiters. For example:

$pattern = '~<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>~';

Read up on delimiters here.

Also note that some HTML specifications (all but XHTML as far as I know) allow uppercase tags, too. So consider adding the modifier for case-insensitivity to your regular expression. Furthermore, removing tables might not work if there are linebreaks between the opening and closing tags (because . does not match line breaks by default). Add the DOTALL modifier s to solve this:

$pattern = '~<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>~is';

One final note: as the others pointed out regex solutions to HTML problems should be taken with a grain of salt. Nested tables will cause issues, as will comments. If you know the data you are dealing with very well, the problem might be much less complex than general HTML. But be sure your code is at least valid and you know about all oddities like nested structures and HTML characters in comments and so on.

edited Oct 13, 2012 at 15:13

answered Oct 13, 2012 at 14:52

Martin Ender

44.4k11 gold badges93 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike Over a year ago

That did it but I think something is broken in the definition of the regex because it does not remove tables.

Martin Ender Over a year ago

. does not match line breaks by default. add another modifier after the i: s .. it's called the DOTALL modifier and now the dot will also match linebreaks... I'll add it to the answer

Community · Accepted Answer · 2017-05-23 12:04:40Z

3

First of all have a look at this answer. This should set things straight from the beginning. If after you've read the answer still want to proceed, I give you the following:

I want to <em<p>>emphasize</<p>em> that it's not possible!

Try to clean that!

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Oct 13, 2012 at 14:59

aefxx

25.4k6 gold badges47 silver badges55 bronze badges

8 Comments

Martin Ender Over a year ago

Technically he is not trying to parse it. Also, is this even valid HTML? If so, what would the semantics of this be. Lastly you could probably solve it, by asserting that there are also no opening < before the close > and then running the replacement multiple times.

Mike Over a year ago

Could not agree more with that! But here the data looks quite uniform and I have to choose between this regex or clean some 5000 articles by hand, which would not be clever or effective.

aefxx Over a year ago

@m.buettner Did you even read the link i've posted? I don't care whether it is valid HTML, it's not the client's (neither a hacker's) responsibility to provide valid HTML. Go on, come up with a regex that catches my sentences and I'll get back to you with a even more complex one, hrhrhr.

Martin Ender Over a year ago

@aefxx I have read and posted that link myself several dozen times. And writing a regex that can also catch strings that are not valid for the set problem is rarely possible, is it? I totally agree with you that HTML is too complex for regular expressions, but sometimes they still get the job done.

aefxx Over a year ago

@m.buettner I'm ruling it out that drastically because it is the wrong tool for the job. Even if it fixes his problem superficially it still does tear a whole in his application.

|

Collectives™ on Stack Overflow

php preg_replace does not recognize the pattern

2 Answers 2

2 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related