.NET Remove/Strip JavaScript and CSS code blocks from HTML page

Question

I have HTML string with the JavaScript and CSS code blocks:

<script type="text/javascript">

  alert('hello world');

</script>

<style type="text/css">
  A:link {text-decoration: none}
  A:visited {text-decoration: none}
  A:active {text-decoration: none}
  A:hover {text-decoration: underline; color: red;}
</style>

How to strip those blocks? Any suggestion about the regular expressions that can be used to remove those?

carla · Accepted Answer · 2017-11-23 15:04:17Z

20

The quick 'n' dirty method would be a regex like this:

var regex = new Regex(
   "(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", 
   RegexOptions.Singleline | RegexOptions.IgnoreCase
);

string ouput = regex.Replace(input, "");

The better* (but possibly slower) option would be to use HtmlAgilityPack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlInput);

var nodes = doc.DocumentNode.SelectNodes("//script|//style");

foreach (var node in nodes)
    node.ParentNode.RemoveChild(node);

string htmlOutput = doc.DocumentNode.OuterHtml;

*) For a discussion about why it's better, see this thread.

edited Nov 23, 2017 at 15:04

carla

2,1471 gold badge34 silver badges48 bronze badges

answered Jun 17, 2011 at 9:20

Elian Ebbing

19.1k5 gold badges50 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

GvS Over a year ago

Do you know Tony The Pony?

Elian Ebbing Over a year ago

@GvS: I know about the problems that can arise when you are using regular expressions to process HTML. So for most cases I would strongly recommend an html parser like HtmlAgilityPack, but it depends on the situation. If it is a one-time batch to remove scripts and style blocks, and I know that the input is valid html, then my regex above can be sufficient, especially since <script> tags and <style> tags can't have nested tags.

Elian Ebbing Over a year ago

@GvS: I added an example that uses HtmlAgilityPack.

Vinnie Amir Over a year ago

Be careful of inline scripts also? E.g. <body onload ="doSomething()">? Need to have a far more sophisticated tools to get rid of that.

Rajeev · Accepted Answer · 2011-06-17 10:47:22Z

2

Use HTMLAgilityPack for better results

or try this function

public string RemoveScriptAndStyle(string HTML)
{
    string Pat = "<(script|style)\\b[^>]*?>.*?</\\1>";
    return Regex.Replace(HTML, Pat, "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
}

answered Jun 17, 2011 at 10:47

Rajeev

4,8992 gold badges28 silver badges35 bronze badges

Comments

cusimar9 · Accepted Answer · 2011-06-17 08:38:06Z

1

Just look for an opening <script tag, and then remove everything between it and the closing /script> tag.

Likewise for the style. See Google for string manipulation tips.

answered Jun 17, 2011 at 8:38

cusimar9

5,2876 gold badges26 silver badges31 bronze badges

2 Comments

kͩeͣmͮpͥ ͩ Over a year ago

doesn't work if your code has document.write("</script>") in it

Bamboo Over a year ago

is it sufficient to just do this in security sense? (prevent javascript from executing)?

Suhan · Accepted Answer · 2013-07-03 09:05:06Z

I made my bike) He may not be as correct as HtmlAgilityPack but it is much faster by about 5-6 times on a page in the 400 kb. Also make symbols lowercase and remove digits(made for tokenizer)

 private static readonly List<byte[]> SPECIAL_TAGS = new List<byte[]>
                                                            {
                                                                Encoding.ASCII.GetBytes("script"),
                                                                Encoding.ASCII.GetBytes("style"),
                                                                Encoding.ASCII.GetBytes("noscript")
                                                            };

    private static readonly List<byte[]> SPECIAL_TAGS_CLOSE = new List<byte[]>
                                                                  {
                                                                      Encoding.ASCII.GetBytes("/script"),
                                                                      Encoding.ASCII.GetBytes("/style"),
                                                                      Encoding.ASCII.GetBytes("/noscript")};

public static string StripTagsCharArray(string source, bool toLowerCase)
    {
        var array = new char[source.Length];
        var arrayIndex = 0;
        var inside = false;
        var haveSpecialTags = false;
        var compareIndex = -1;
        var singleQouteMode = false;
        var doubleQouteMode = false;
        var matchMemory = SetDefaultMemory(SPECIAL_TAGS);
        for (int i = 0; i < source.Length; i++)
        {
            var let = source[i];
            if (inside && !singleQouteMode && !doubleQouteMode)
            {
                compareIndex++;
                if (haveSpecialTags)
                {
                    var endTag = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS_CLOSE, ref matchMemory);
                    if (endTag) haveSpecialTags = false;
                }
                if (!haveSpecialTags)
                {
                    haveSpecialTags = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS, ref matchMemory);
                }
            }
            if (haveSpecialTags && let == '"')
            {
                doubleQouteMode = !doubleQouteMode;
            }
            if (haveSpecialTags && let == '\'')
            {
                singleQouteMode = !singleQouteMode;
            }
            if (let == '<')
            {
                matchMemory = SetDefaultMemory(SPECIAL_TAGS);
                compareIndex = -1;
                inside = true;
                continue;
            }
            if (let == '>')
            {
                inside = false;
                continue;
            }
            if (inside) continue;
            if (char.IsDigit(let)) continue; 
            if (haveSpecialTags) continue;
            array[arrayIndex] = toLowerCase ? Char.ToLowerInvariant(let) : let;
            arrayIndex++;
        }
        return new string(array, 0, arrayIndex);
    }

    private static bool[] SetDefaultMemory(List<byte[]> specialTags)
    {
        var memory = new bool[specialTags.Count];
        for (int i = 0; i < memory.Length; i++)
        {
            memory[i] = true;
        }
        return memory;
    }

RadicalGratitude · Accepted Answer · 2020-07-20 17:54:57Z

Similar to Elian Ebbing's answer and Rajeev's answer, I opted for the more stable solution of using an HTML library, not regular expressions. But instead of using HtmlAgilityPack I used AngleSharp, which gave me jquery-like selectors, in .NET Core 3:

//using AngleSharp;
var context = BrowsingContext.New(Configuration.Default);
var document = await context.OpenAsync(req => req.Content(sourceHtml)); // generate HTML DOM from source html string
var elems = document.QuerySelectorAll("script, style"); // get script and style elements
foreach(var elem in elems)
{
    var parent = elem.Parent;
    parent.RemoveChild(elem); // remove element from DOM
}
var resultHtml = document.DocumentElement.OuterHtml; // HTML result as a string

Collectives™ on Stack Overflow

.NET Remove/Strip JavaScript and CSS code blocks from HTML page

5 Answers 5

4 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related