Regex to strip outer HTML tags in string

Question

I need a regex to strip outer/top level HTML tags in a string but keep the internal ones.

$str = "<div>Start <br /> <span>test</span> end.</div>";

Into

$str = "Start <br /> <span>test</span> end.";

As well as

$str = "<aside id="main" class="one">Start <br /> <span>test</span> end.</aside>";

Into

$str = "Start <br /> <span>test</span> end.";

.

preg_replace('/<[^>]*>/', '', $str);

Removes all tags not just outer ones.

DOM cannot select contents of the tag without stripping tags that present in content. It can however select entire tag with contents like <div>bla <br> bla</div> and now i just need to strip the actual outer tag (div in this case) and keep the content with tags. — Roman Toasov
– Roman Toasov, Commented Feb 23, 2015 at 17:02
$html = $domElement->ownerDocument->saveHTML($domElement); should return the content of the Dom node in $html without stripping the tags within it — Mark Baker
– Mark Baker, Commented Feb 23, 2015 at 17:07
The question is not if DOM is better than regex, but how to do it with a regex... There are valid reasons to use regex instead of DOM, one major adavantage of regex is much faster performance than DOM (see here blog.futtta.be/2014/05/01/…) — Philipp
– Philipp, Commented Apr 2, 2015 at 15:17
1000 views, and only a downvote. Bravo SO, this place used to be positive. What happened? no answer, just some people thinking highly of themselves — Toskan
– Toskan, Commented Sep 14, 2017 at 17:50

Philipp · Accepted Answer · 2015-04-02 15:44:38Z

Please note

Using a regex is not the best way to modify HTML code! In most situations it is better and much more reliable to use a DOMDocument or DOMDocumentFragement object to modify or extract data from HTML code.

However, there are valid scenarios where a regex is better, mainly when these factors apply:

You know that the HTML code that you edit is going to be valid.
The HTML structure that is modified will be identical in all cases.
You're doing only very simple changes to the code.
Performance is important (e.g. when it is executed inside a loop). DOMDocument is considerably slower than a simple regex!

The code

To strip the outermost tag from some HTML code use this regex:

/* Note: 
 * The code must start with an opening tag and end with a closing tag. 
 * No white space or other text must be present before the first 
 * tag/after the last tag, else you get some unexpected results.
 */

$contents = preg_replace( '/^<[^>]+>|<\/[^>]+>$/', '', $markup );
            // ^<[^>]+>     This removes the first tag
            // <\/[^>]+>$   This removes the last closing tag

Examples

This regex works for most HTML markup e.g.

In: '<div class="my-text" id="text" style="color:red">some text</div>'
Out: 'some text' (expected result)

When the first tag contains the ">" character it's going to break everything, e.g.

In: '<div title="Home > Archives">Archive overview</div>'
Out: ' Archives">Archive overview' (unexpected result)

Also whitespace/text in the start or end will break the regex

In: '<div>Your name</div>:'
Out: 'Your name</div>:' (unexpected result)

And of course, any tag will be stripped, without any sanity check, e.g.

In: '<h2>Settings</h2><label>Page Title</label>'
Out: 'Settings</h2><label>Page Title' (unexpected result)

Mark Baker · Accepted Answer · 2015-02-23 17:41:02Z

2

How to take a DOM element, and simulate innerHTML()

$html = '<html><body><div><ul><li>1</li><li>2</li><li>3</li></ul></div></body></html>';

function DOMinnerHTML(DOMNode $element) { 
    $innerHTML = "";
    foreach ($element->childNodes as $child) { 
        $innerHTML .= $element->ownerDocument->saveHTML($child);
    }

    return $innerHTML; 
} 

$doc = new DOMDocument();
$doc->loadHTML($html);

foreach ($doc->getElementsByTagName('ul') as $child) {
    $html = DOMinnerHTML($child); 
    echo $html, PHP_EOL;
}

without having to resort to regexp

answered Feb 23, 2015 at 17:41

Mark Baker

213k34 gold badges354 silver badges390 bronze badges

1 Comment

J Robz Over a year ago

Any way to do this without knowing the parent tagname?

Regular Jo · Accepted Answer · 2015-02-23 17:56:50Z

0

This basic regex will probably do. It does not, however, account for tags that have attributes that contain >s, and thus will trip.

Find: <[^>]*>([\s\S]*)<\/[^>]*>
Replace: $1

It gets more complex if you expect attributes may contain tag brackets.

Find: <(?:[^>]*?(?:(?:"[^"]*?"|'[^']*?')+[^>]*?)|[\s\S]*?)>([\s\S]*)<\/[^>]*>
Replace: $1

Either one should do the trick.

edited Feb 23, 2015 at 17:56

answered Feb 23, 2015 at 10:25

Regular Jo

5,5483 gold badges28 silver badges54 bronze badges

2 Comments

Roman Toasov Over a year ago

Getting error Warning: preg_replace(): Unknown modifier ']' on first regex.

Regular Jo Over a year ago

@RomanToasov Try escaping the forward slash. <[^>]*>([\s\S]*)<\/[^>]*>

Yuseferi · Accepted Answer · 2015-04-11 17:16:13Z

I made a function that removes the HTML tags along with their contents:

Function:

<?php
function strip_tags_content($text, $tags = '', $invert = FALSE) {

  preg_match_all('/<(.+?)[\s]*\/?[\s]*>/si', trim($tags), $tags);
  $tags = array_unique($tags[1]);

  if(is_array($tags) AND count($tags) > 0) {
    if($invert == FALSE) {
      return preg_replace('@<(?!(?:'. implode('|', $tags) .')\b)(\w+)\b.*?>.*?</\1>@si', '', $text);
    }
    else {
      return preg_replace('@<('. implode('|', $tags) .')\b.*?>.*?</\1>@si', '', $text);
    }
  }
  elseif($invert == FALSE) {
    return preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $text);
  }
  return $text;
}
?>

Sample text: $text = 'sample text with tags';

Result for strip_tags($text): sample text with tags

Result for strip_tags_content($text): text with

Result for strip_tags_content($text, ''): sample text with

Result for strip_tags_content($text, '', TRUE); text with tags

I hope that someone is useful :)

Collectives™ on Stack Overflow

Regex to strip outer HTML tags in string

4 Answers 4

Comments

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related