I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.
"<script.*/>"
"<script[^>]*>.*</script>"
"<script.*?>[\\s\\S]*?</.*?script>"
Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?
An example of what I am trying to remove:
<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
<script type="text/javascript">
<!--
var Time=new Application('Time')
//-->
</script>
<script type="text/javascript">
if(window['com.actions']) {
window['com.actions'].approvalStatement = "",
window['com.actions'].hasApprovalStatement = false
}
</script>
.*, so it will match everything from the first<script>on the page to the last</script>, possibly including content between script tags that you didn't mean to remove.<span><b><i><u>hello <span class="mundo">world</span></u></i></b></span>) script tags have basically no nesting, so it's nowhere near as pertinent (comment or CDATA tags are often used inside script tags, but these are not a challenge to ignore). REMOVING or STRIPPING HTML is slightly different, as expressions can be significantly less complex.