1

I'm looking for a regex that will allow me to get all javscript and css link tags in a string so that I can strip certain tags from a DotNetNuke (Yeah I know.... ouch!) page on an overridden render event.

I know about the html agility pack i've even read Jeff Atwoods blog entry but unfortunately I don't have the luxury of a 3rd party library.

Any help would be appreciated.

Edit, I gave this a try to get a javascript entry but it didn't work. Regex's are a dark art to me.

updatedPageSource = Regex.Replace(
pageSource, 
String.Format("<script type=\"text/javascript\" src=\".*?{0}\"></script>",
 name), "", RegexOptions.IgnoreCase);
3
  • 1
    Don't do it! Regex == ouch! Commented Feb 11, 2011 at 13:52
  • "unfortunately I don't have the luxury of a 3rd party library." Care to explain why? Commented Feb 11, 2011 at 13:54
  • @marcog I'm working on a project that has to be finished today. If I introduce a 3rd party solution I have to get it checked etc to see if it's ok. Commented Feb 11, 2011 at 14:02

3 Answers 3

1

I have a few comments on this, your RegEx is close, the following has been tested to work

<script type="text/javascript" src=".*myfile.js"></script>

I used the following test inputs

<script type="text/javascript" src="myfile.js"></script>
<script type="text/javascript" src="/test/myfile.js"></script>
<script type="text/javascript" src="/test/Looky/myfile.js"></script>

However, I would caution on this approach, and it does take time to parse, can be error prone, etc...

Sign up to request clarification or add additional context in comments.

Comments

1

DISCLAIMER: Regex + HTML = ouch!

Your problem may be that you are not escaping the Regex metacharacters from name (e.g. the dot metacharacter '.'). You may want to try this:

updatedPageSource = Regex.Replace(
    pageSource, 
    String.Format("<script\\s+type=\"text/javascript\"\\s+src=\".*?{0}\"\\s*>\\s*</script>", Regex.Escape(name)),
    "",
    RegexOptions.IgnoreCase);

// Just one of the many reasons why you don't mix Regex with HTML:
updatedPageSource = Regex.Replace(
    updatedPageSource, 
    String.Format("<script\\s+src=\".*?{0}\"\\s+type=\"text/javascript\"\\s*>\\s*</script>", Regex.Escape(name)),
    "",
    RegexOptions.IgnoreCase);

I also added optional whitespace here and there.

3 Comments

Watch out for the greedy .* in your code. That will match all the way to the last </script> tag it can find. You want .*?.
Thanks, we learn something new every day. For reference: Regex: Greedy vs Lazy
Oh...and your \s needs to be doubly escaped. Either that, or use @"...", but then you'll have to escape the " by doubling them. :)
0

Don't forget to account for things like whitespace, other attributes, different orders of attributes (i.e. src="foo" type="bar" vs type="bar" src="foo"), and " vs ' quoting. Maybe this?

@"<\s*script\b.*?\bsrc=(""|').*?{0}\1\b.*?(/>|>\s*</\s*script\s*>)"

I went ahead and took out the type attribute. If you have the filename, you know what type of script it is anyway; plus, this accounts for tags where the src tag comes first, or they used the deprecated language tag, or they omitted type altogether (it's supposed to be there, but it isn't always). Note that I'm using the lazy .*? so that it doesn't match all the way to the last </script> in the page.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.