0

Possible Duplicate:
Remove JavaScript with Regex

How can i remove all content between <script.... and ... </script>? If I write:

s = s.replaceAll("<script.+</script>", "");

It removes everything between the first <script until the last </script>, but i want to remove from the first <script until the first </script> please help

2
  • 1
    if this is supposed to be a security measure then it won't work Commented Nov 23, 2011 at 16:44
  • @OttoAllmendinger - as a security measure, I don't think it can possible be made fool-proof, but it can certainly be made to mangle and invalidate any attempt to bypass a security measure, and it can be made to gracefully and effectively remove properly formatted script Commented Nov 23, 2011 at 17:20

3 Answers 3

2

It's generally a bad idea to use regexes to parse HTML — there are infinitely many corner cases, and it's a lot of effort to catch them all (what if your input is <!-- <script> --> foo <!-- </script> -->?) — but to answer your very specific question: change +, which is a "greedy" quantifier that consumes as much as it can, to +?, which is a "reluctant" quantifier that consumes as little as it can.

See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.

Sign up to request clarification or add additional context in comments.

1 Comment

Thx, tried with the DOM parsing, but did not work for me) I am not allowed to use third party libraries, so trying to do that with regex
0

I have suggested this in the past:

<\s*script.*?(/\s*>|<\s*/\s*script[^>]*>)

Use the "single-line" or "dotall" compiler switch, as appropriate to your language or tool.

For more information, see my answer here: https://stackoverflow.com/q/8043367/561690

In response to comments, I have made changes that should do nothing more than to make it even harder to get something by it successfully. As for any whitespace between < and script - I wouldn't put it past someone to ignore that part of the recommendation (Standard?) when building a parser, in the name of making it more flexible, so I'll leave it as part of my answer!

5 Comments

Your \s?s seem very odd to me. Whitespace between the < and script will invalidate that tag; and whitespace between the </script and > is not limited to a single character.
In this you may be absolutely correct. However, the user is brand-spanking new, and asked a question that is for all practical intents and purposes identical to the question I referenced - so there's the question of ROI. I'll make the changes you suggest, but I have a suspicion it won't make much difference! thanks regardless!
This will convert <s<script></script>cript>alert(1337)</script> to <script>alert(1337)</script>.
@mikesamuel - it will then convert <script>alert(1337)</script> to `` will it not?
@CodeJockey, If the replacement was done in a loop that looped until convergence, yes, but I don't see anything in your answer about doing the replacement in a loop. A single global replacement is insufficient.
0

OWASP Java HTML Sanitizer is an HTML sanitizer sponsored by OWASP written in Java that takes a string of HTML and whitelists tags and attributes to produce a string of safe HTML.

It's gone through multiple rounds of attack review and fits the same niche as AntiSAMY.

Full disclosure: I am a maintainer.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.