4

I need a regex to strip outer/top level HTML tags in a string but keep the internal ones.

$str = "<div>Start <br /> <span>test</span> end.</div>";

Into

$str = "Start <br /> <span>test</span> end.";

As well as

$str = "<aside id="main" class="one">Start <br /> <span>test</span> end.</aside>";

Into

$str = "Start <br /> <span>test</span> end.";

.

preg_replace('/<[^>]*>/', '', $str);

Removes all tags not just outer ones.

5
  • 8
    Better to use DOM than a hacky regexp Commented Feb 23, 2015 at 9:52
  • DOM cannot select contents of the tag without stripping tags that present in content. It can however select entire tag with contents like <div>bla <br> bla</div> and now i just need to strip the actual outer tag (div in this case) and keep the content with tags. Commented Feb 23, 2015 at 17:02
  • 1
    $html = $domElement->ownerDocument->saveHTML($domElement); should return the content of the Dom node in $html without stripping the tags within it Commented Feb 23, 2015 at 17:07
  • 1
    The question is not if DOM is better than regex, but how to do it with a regex... There are valid reasons to use regex instead of DOM, one major adavantage of regex is much faster performance than DOM (see here blog.futtta.be/2014/05/01/…) Commented Apr 2, 2015 at 15:17
  • 1000 views, and only a downvote. Bravo SO, this place used to be positive. What happened? no answer, just some people thinking highly of themselves Commented Sep 14, 2017 at 17:50

4 Answers 4

4

Please note

Using a regex is not the best way to modify HTML code! In most situations it is better and much more reliable to use a DOMDocument or DOMDocumentFragement object to modify or extract data from HTML code.

However, there are valid scenarios where a regex is better, mainly when these factors apply:

  • You know that the HTML code that you edit is going to be valid.
  • The HTML structure that is modified will be identical in all cases.
  • You're doing only very simple changes to the code.
  • Performance is important (e.g. when it is executed inside a loop). DOMDocument is considerably slower than a simple regex!

The code

To strip the outermost tag from some HTML code use this regex:

/* Note: 
 * The code must start with an opening tag and end with a closing tag. 
 * No white space or other text must be present before the first 
 * tag/after the last tag, else you get some unexpected results.
 */

$contents = preg_replace( '/^<[^>]+>|<\/[^>]+>$/', '', $markup );
            // ^<[^>]+>     This removes the first tag
            // <\/[^>]+>$   This removes the last closing tag

Examples

This regex works for most HTML markup e.g.

In: '<div class="my-text" id="text" style="color:red">some text</div>'
Out: 'some text' (expected result)

When the first tag contains the ">" character it's going to break everything, e.g.

In: '<div title="Home > Archives">Archive overview</div>'
Out: ' Archives">Archive overview' (unexpected result)

Also whitespace/text in the start or end will break the regex

In: '<div>Your name</div>:'
Out: 'Your name</div>:' (unexpected result)

And of course, any tag will be stripped, without any sanity check, e.g.

In: '<h2>Settings</h2><label>Page Title</label>'
Out: 'Settings</h2><label>Page Title' (unexpected result)
Sign up to request clarification or add additional context in comments.

Comments

2

How to take a DOM element, and simulate innerHTML()

$html = '<html><body><div><ul><li>1</li><li>2</li><li>3</li></ul></div></body></html>';

function DOMinnerHTML(DOMNode $element) { 
    $innerHTML = "";
    foreach ($element->childNodes as $child) { 
        $innerHTML .= $element->ownerDocument->saveHTML($child);
    }

    return $innerHTML; 
} 

$doc = new DOMDocument();
$doc->loadHTML($html);

foreach ($doc->getElementsByTagName('ul') as $child) {
    $html = DOMinnerHTML($child); 
    echo $html, PHP_EOL;
}

without having to resort to regexp

1 Comment

Any way to do this without knowing the parent tagname?
0

This basic regex will probably do. It does not, however, account for tags that have attributes that contain >s, and thus will trip.

Find: <[^>]*>([\s\S]*)<\/[^>]*>
Replace: $1

It gets more complex if you expect attributes may contain tag brackets.

Find: <(?:[^>]*?(?:(?:"[^"]*?"|'[^']*?')+[^>]*?)|[\s\S]*?)>([\s\S]*)<\/[^>]*>
Replace: $1

Either one should do the trick.

2 Comments

Getting error Warning: preg_replace(): Unknown modifier ']' on first regex.
@RomanToasov Try escaping the forward slash. <[^>]*>([\s\S]*)<\/[^>]*>
0

I made a function that removes the HTML tags along with their contents:

Function:

<?php
function strip_tags_content($text, $tags = '', $invert = FALSE) {

  preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
  $tags = array_unique($tags[1]);

  if(is_array($tags) AND count($tags) > 0) {
    if($invert == FALSE) {
      return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
    }
    else {
      return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text);
    }
  }
  elseif($invert == FALSE) {
    return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
  }
  return $text;
}
?>

Sample text: $text = 'sample text with tags';

Result for strip_tags($text): sample text with tags

Result for strip_tags_content($text): text with

Result for strip_tags_content($text, ''): sample text with

Result for strip_tags_content($text, '', TRUE); text with tags

I hope that someone is useful :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.