1

I am trying to make a so called text cleaner so that I could get rid of a few html elements without using the strip_tags() function.

My regex looks like this: <em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>

My code looks like this:

$string = "some very messy string here ";
$pattern = '<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>';
$replace = ' ';

$clean =  preg_replace($pattern, $replace, $string);

echo $clean;

For reasons that are beyond my understanding the echo returns nothing.

Thank you for your time

UPDATE #1

If you are asking if I want to get rid of the tables with all the content inside them the answer is yes.

6
  • what is the objective of this code - why do you want to avoid using strip_tags? Commented Oct 13, 2012 at 14:52
  • Strip tags would not delete the content of tables which I would like to do. Commented Oct 13, 2012 at 14:55
  • You're better off not using a regex to pseudo-parse html. strip tags will strip tags, and if you want to remove tables - write a routine to remote tables. you're going to get weird results with e.g.: "<table>...<table>...</table>...</table>". Commented Oct 13, 2012 at 15:00
  • He would have to run the replacement multiple times to get rid of nested tables. Commented Oct 13, 2012 at 15:02
  • @m.buettner wouldn't work, after running it the first time the input string would be "before table string...</table>after table string" there would be no <table> to match, a subsequent pass would not remove it. relevant stackoverflow.com/a/1732454/761202 Commented Oct 13, 2012 at 15:05

2 Answers 2

4

Your regular expression needs delimiters. For example:

$pattern = '~<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>~';

Read up on delimiters here.

Also note that some HTML specifications (all but XHTML as far as I know) allow uppercase tags, too. So consider adding the modifier for case-insensitivity to your regular expression. Furthermore, removing tables might not work if there are linebreaks between the opening and closing tags (because . does not match line breaks by default). Add the DOTALL modifier s to solve this:

$pattern = '~<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>~is';

One final note: as the others pointed out regex solutions to HTML problems should be taken with a grain of salt. Nested tables will cause issues, as will comments. If you know the data you are dealing with very well, the problem might be much less complex than general HTML. But be sure your code is at least valid and you know about all oddities like nested structures and HTML characters in comments and so on.

Sign up to request clarification or add additional context in comments.

2 Comments

That did it but I think something is broken in the definition of the regex because it does not remove tables.
. does not match line breaks by default. add another modifier after the i: s .. it's called the DOTALL modifier and now the dot will also match linebreaks... I'll add it to the answer
3

First of all have a look at this answer. This should set things straight from the beginning. If after you've read the answer still want to proceed, I give you the following:

I want to <em<p>>emphasize</<p>em> that it's not possible!

Try to clean that!

8 Comments

Technically he is not trying to parse it. Also, is this even valid HTML? If so, what would the semantics of this be. Lastly you could probably solve it, by asserting that there are also no opening < before the close > and then running the replacement multiple times.
Could not agree more with that! But here the data looks quite uniform and I have to choose between this regex or clean some 5000 articles by hand, which would not be clever or effective.
@m.buettner Did you even read the link i've posted? I don't care whether it is valid HTML, it's not the client's (neither a hacker's) responsibility to provide valid HTML. Go on, come up with a regex that catches my sentences and I'll get back to you with a even more complex one, hrhrhr.
@aefxx I have read and posted that link myself several dozen times. And writing a regex that can also catch strings that are not valid for the set problem is rarely possible, is it? I totally agree with you that HTML is too complex for regular expressions, but sometimes they still get the job done.
@m.buettner I'm ruling it out that drastically because it is the wrong tool for the job. Even if it fixes his problem superficially it still does tear a whole in his application.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.