1

I use the below regex to replace text between two words. It works, except that it skips some of them. Pasted below is an example.

var EditedHtml = Regex.Replace(htmlText, @"<script(.*?)</script>", ""); 

htmlText :

 <head>
   <script src=" https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js" type="text/javascript"></script>
   <script src=" https://ajax.googleapis.com/ajax/libs/jqueryui/1.8.18/jquery-ui.min.js" type="text/javascript"></script>
   <script src="/AspellWeb/v2/js/dragiframe.js" type="text/javascript"></script>
   <script type="text/javascript">
     var applicationName = '/';
     FullPath = (applicationName.length > 1) ? 'http://localhost:65355' + applicationName : 'http://localhost:65355';
     //FullPath = 'http://localhost:65355';
     GetPath = function (url) {
     return FullPath + url;
   }
   </script>

   <script type="text/javascript" src="../../Scripts/stats.js?"></script>
</head>

<body>
  .......
  <script type="text/javascript">
    function loadAndInit() {

    $(".dvloading").hide();
    if ($.browser.mozilla) {
      if (location.pathname == "/Stats/Reports") {            // This is for local env.
        $("#prntCss").attr("href", "../../../Content/SitePrint_FF.css");
      }
      else {                                                  // This is for DEV/QA/STAGE/PROD env. 
        $("#prntCss").attr("href", "../../Content/SitePrint_FF.css");
      }
    }

  }
  </script>
</body>

EditedHtml :

<head>
  <script type="text/javascript">
    var applicationName = '/';
    FullPath = (applicationName.length > 1) ? 'http://localhost:65355' + applicationName : 'http://localhost:65355';
    //FullPath = 'http://localhost:65355';
    GetPath = function (url) {
      return FullPath + url;
    }
  </script>
</head>

<body>
  .......
  <script type="text/javascript">
    function loadAndInit() {

      $(".dvloading").hide();
      if ($.browser.mozilla) {
        if (location.pathname == "/Stats/Reports") {            // This is for local env.
          $("#prntCss").attr("href", "../../../Content/SitePrint_FF.css");
        }
        else {                                                  // This is for DEV/QA/STAGE/PROD env. 
          $("#prntCss").attr("href", "../../Content/SitePrint_FF.css");
        }
      }

    }
  </script>
</body>
2
  • 1
    You need to use RegexOptions.Singleline to get .(dot) to match newlines. Commented Apr 16, 2013 at 22:30
  • my first guess is that the dot doesn't match newline, try doing [.\r\n]*? instead Commented Apr 16, 2013 at 22:33

4 Answers 4

4

Why do you use Regex to parse html. See this

It would be much easier to use a real html parser like HtmlAgilityPack

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(filename); //or doc.LoadHtml(HtmlString)

doc.DocumentNode.Descendants()
    .Where(n => n.Name == "script").ToList()
    .ForEach(s => s.Remove());

StringWriter wr = new StringWriter();
doc.Save(wr);
var newhtml = wr.ToString();
Sign up to request clarification or add additional context in comments.

2 Comments

doc.load throws "illegal characters in path" exception. Should be doc.loadHtml()
@BumbleBee doc.load requires a filename. if you want load a string then you should use doc.LoadHtml as i commented in the answer.
2

Try it in single line mode:

var EditedHtml = Regex.Replace(
    htmlText, @"<script(.*?)</script>", "", 
    RegexOptions.Singleline); 

Documentation quote:

Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

2 Comments

Why do people insist on parsing html with regex? Just a simple case <html><!-- <script --> test<!-- </script> --></html>. My browser shows "test" for this html. But your regex removes test from it
My regex? This is OP's regex. I'm not passing judgement on OP's choice of tool for his job, I'm just correcting his code. I agree that a proper parser would be better for robustness, but a quick and dirty regex is fine sometimes. Maybe the html follows a known format, maybe it's a one-off script.
2

Try

var EditedHtml = Regex.Replace(
    htmlText, @"<script(.*?)</script>", "", RegexOptions.Singleline
); 

Use singleline mode so the . matches any character including newlines.

1 Comment

Why do people insist on parsing html with regex? Just a simple case <html><!-- <script --> test<!-- </script> --></html>. My browser shows "test" for this html. But your regex removes test from it
0

Try this:

//(.|\r\n)*: matches every character and/or newline zero or more times
//(.|\r\n)*?: as few times as possible == > you get rid of <script> tags and of their content but you keep the rest of your html
var EditedHtml = Regex.Replace(htmlText, @"<script (.|\r\n)*?</script>", ""); 

Hope it helps

References: http://msdn.microsoft.com/en-us/library/az24scfc.aspx

4 Comments

In .NET regexes, . matches every character except linefeed (\n), so you would only have to use (.|\n)*?. But it's easier and more efficient to use .*? and specify Singleline mode, as others have suggested.
Thanks for your feedback, I must admit that it escapes me why using Singleline mode is more efficient, could you please clarify this point?
First, you have to enclose it in a group, so you've got the extra overhead of entering and leaving the group every time you consume a character. And you're using a capturing group, which adds even more overhead. Second, alternation itself tends to be less efficient than an equivalent character class. (.|\n) is so simple the regex engine can probably optimize it away, but a more complicated alternation can easily bring the engine to its knees, as this answer explains.
'Fraid I can't help you with that. You might try a .NET-specific discussion forum; I'm sure there are many of them out there. But this is definitely not the place for questions like that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.