2

I'm doing a forum like web app. Users are allowed to submit rich html text to server such as p tag, div tag, etc. In order to keep the format, server will write these tags back to the users' browser directly(without html encoded). So, I must do a potential dangerous script check to avoid XSS. Any JavaScript code is supposed to be dangerous and not allowed. So, How to detect them or any other better solution?

dangerous example 1:

<script>alert('1')</script>

dangerous example 2:

<script src="..."></script>

dangerous example 3:

<a href="javascript:dangerousFunction();">click me</a>
4
  • stackoverflow.com/a/21729561/7106750 This might help you @guogangj Commented Dec 29, 2016 at 3:01
  • Maybe checkout [this](stackoverflow.com/questions/15458876/… ) Try to get that but in JS Commented Dec 29, 2016 at 3:08
  • Only allow a certain subset of tags, e.g., <p>, <div>, <strong>, <em>, etc.; remove all other tags. Commented Dec 29, 2016 at 3:23
  • 1
    What you are trying to do is called "sanitizing". Please google for that. You will find lots of libraries etc., that you can either use as is, or borrow from. Commented Dec 29, 2016 at 3:24

2 Answers 2

2

Use an HTML Parser

Your requirements are straightforward:

  • You must disallow all <script> tags, but keep certain rich HTML tags.
  • You must be able to escape inline Javascript in links. i.e. stringify it or strip the unsafe attributes altogether.

The correct way to handle all of these is to employ a modern standards-compliant HTML parser that is able to syntactically analyse the structure of the rich HTML sent over, identifying the tags sent over and discovering the raw values in attributes. This is, in fact, how sanitisation, as one of the comments mentions, is done.

There are a number of pre-existing HTML parsers that are designed to target XSS-unsafe input. The npm library js-xss, for example, appears to be able to do exactly what you want:

You can even run this server-side as a command line utility.

Similar libraries already exist for most languages, and you should do a thorough search of your preferred language's package repository. Alternatively, you can launch a subprocess and collect your results directly from js-xss from the command line.

Avoid using regular expressions to parse HTML naively - while it is true most HTML parsers end up using regular expressions under the hood, they do so in a fairly limited fashion for strictly well-defined grammars after correctly lexing them.

Sign up to request clarification or add additional context in comments.

Comments

-3

Use this regex

<script([^'"]|"(\\.|[^"\\])*"|'(\\.|[^'\\])*')*?<\/script>

for detecting all types of <script> tag

but I suggest using a iframe in sandbox mode to show ALL html code, by doing that you prevent javascript code from being able to do anything bad.

http://www.w3schools.com/tags/att_iframe_sandbox.asp

I hope this helps!

2 Comments

That doesn't take care of inline script.
I gave a solution to use sandbox iframes... it is almost impossible to detect inline script

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.