Remove JavaScript with Regex

Question

I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.

    "<script.*/>"

    "<script[^>]*>.*</script>"

    "<script.*?>[\\s\\S]*?</.*?script>"

Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?

An example of what I am trying to remove:

<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
    <script type="text/javascript">
    <!--
        var Time=new Application('Time')
    //-->
    </script>
    <script type="text/javascript">
        if(window['com.actions']) {
            window['com.actions'].approvalStatement =  "",
            window['com.actions'].hasApprovalStatement = false
        }
    </script>

Use an HTML parser (like Nokogiri) and modify the DOM; do not use a regex on the raw HTML. Are you trying to do this on the web browser client or on the server? If the server, what programming language? — Phrogz
– Phrogz, Commented Nov 7, 2011 at 19:20
If anything, it looks like your regexes will match more than you want. Your #2 is doing a greedy .*, so it will match everything from the first <script> on the page to the last </script>, possibly including content between script tags that you didn't mean to remove. — Joe White
– Joe White, Commented Nov 7, 2011 at 19:29
Language is C#. Using the mshtml parser actually runs the java script which is what I am trying to avoid by removing it in the first place. — tcables
– tcables, Commented Nov 7, 2011 at 19:32
Regex is not particularly good for PARSING HTML - but that is because HTML allows nesting constructs (like <span><b><i><u>hello <span class="mundo">world</span></u></i></b></span>) script tags have basically no nesting, so it's nowhere near as pertinent (comment or CDATA tags are often used inside script tags, but these are not a challenge to ignore). REMOVING or STRIPPING HTML is slightly different, as expressions can be significantly less complex. — Code Jockey
– Code Jockey, Commented Nov 7, 2011 at 21:39

Code Jockey · Accepted Answer · 2011-11-07 22:13:50Z

I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):

@"(?s)<script.*?(/>|</script>)"

That's it - I hope! (It certainly works for your examples!)

My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags

For example,

<b> bold <i> AND italic </i></b>

...is not so bad, but

<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>

would be much harder to parse, because the ending tags are IDENTICAL.

However, since it is invalid to nest script tags, the next instance of />(<-is this valid?) or </script> is the end of this script block.

There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:

@"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"

Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)

Of course, this should handle complete removal of any valid script blocks, and valid HTML in should be valid HTML out (minus script blocks)

Alex Turpin · Accepted Answer · 2011-11-07 19:21:04Z

3

It is generally agreed upon that trying to parse HTML with regex is a bad idea and will yield bad results. Instead, you should use a DOM parser. jQuery wraps nicely around the browser's DOM and would allow you to very easily remove all <script> tags.

answered Nov 7, 2011 at 19:21

Alex Turpin

47.9k23 gold badges118 silver badges146 bronze badges

2 Comments

Joe White Over a year ago

Heh. I like the irony of using jQuery to remove JavaScript.

Alex Turpin Over a year ago

The HTML Agility Pack seems to be the standard C# solution for this.

Olivier Rassi · Accepted Answer · 2016-07-06 09:18:14Z

ok I have faced a similar case, when I need to clean "rich text" (text with HTML formatting) from any possible javascript-ing.

there are several ways to add javascript to HTML:

by using the <script> tag, with javascript inside it or by loading a javascript file using the "src" attribue. ex: <script>maliciousCode();</script>
by using an event on an HTML element, such as "onload" or "onmouseover" ex: <img src="a.jpg" onload="maliciousCode()">
by creating a hyperlink that calls javascript code ex: <a href="javascript:maliciousCode()">...

This is all I can think of for now.

So the submitted HTML Code needs to be cleaned from these 3 cases. A simple solution would be to look for these patterns using Regex, and replace them by "" or do whatever else you want.

This is a simple code to do this:

public static string CleanHTMLFromScript(string str)
{
    Regex re = new Regex("<script[^>]*>", RegexOptions.IgnoreCase);
    str = re.Replace(str, "");
    re = new Regex("<[a-z][^>]*on[a-z]+=\"?[^\"]*\"?[^>]*>", RegexOptions.IgnoreCase);
    str = re.Replace(str, "");
    re = new Regex("<a\\s+href\\s*=\\s*\"?\\s*javascript:[^\"]*\"[^>]*>", RegexOptions.IgnoreCase);
    str = re.Replace(str, "");
    return(str);
}

This code takes care of any spaces and quotes that may or may not be added. It seems to be working fine, not perfect but it does the trick. Any improvements are welcome.

Community · Accepted Answer · 2017-05-23 12:22:28Z

0

Creating your own HTML parser or script detector is a particularly bad idea if this is being done to prevent cross-site scripting. Doing this by hand is a Very Bad Idea, because there are any number of corner cases and tricks that can be used to defeat such an attempt. This is termed "black listing", as it attempts to remove the unsafe items from HTML, and it's pretty much doomed to failure.

Much safer to use a white list processor (such as AntiSamy), which only allows approved items through by automatically escaping everything else.

Of course, if this isn't what you're doing then you should probably edit your question to give some more context...

Edit:

Now that we know you're using C#, try the HTMLAgilityPack as suggested here.

edited May 23, 2017 at 12:22

CommunityBot

11 silver badge

answered Nov 7, 2011 at 19:32

Scott A

7,8643 gold badges35 silver badges47 bronze badges

1 Comment

tcables Over a year ago

I have had troubles with bugs in the agility pack in the past so I tend to stay away from it...but thanks for the suggestion.

Michael Stum · Accepted Answer · 2011-11-07 20:34:33Z

0

Which language are you using? As a general statement, Regular Expressions are not suitable for parsing HTML.

If you are on the .net Platform, the HTML Agility Pack offers a much better parser.

answered Nov 7, 2011 at 20:34

Michael Stum

182k120 gold badges411 silver badges541 bronze badges

Comments

user557597 · Accepted Answer · 2011-11-07 20:44:06Z

You should use a real html parser for the job. That being said, for simple stripping
of script blocks you could use a rudimentary regex like below.

The idea is that you will need a callback to determine if capture group 1 matched.
If it did, the callback should pass back things that hide html (like comments) back
through unchanged, and the script blocks are passed back as an empty string.

This won't substitute for an html processor though. Good luck!

Search Regex: (modifiers - expanded, global, include newlines in dot, callback func)

  (?:
     <script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*> .*? </script\s*>
   | </?script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*/?>
  )
|
  (   # Capture group 1
    <!(?:DOCTYPE.*?|--.*?--)>  # things that hide html, add more constructs here ...
  )

Replacement func pseudo code:

string callback () {
  if capture buffer 1 matched
    return capt buffer 1
  else return ''

}

Collectives™ on Stack Overflow

Remove JavaScript with Regex

6 Answers 6

1 Comment

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related