10

What's the best library/approach for removing Javascript from HTML that will be displayed?

For example, take:

<html><body><span onmousemove='doBadXss()'>test</span></body></html>

and leave:

<html><body><span>test</span></body></html>

I see the DeXSS project. But is that the best way to go?

4
  • Probably, the easiest way to do it is to use XSLT (write a stylesheet that copies the allowable elements and attributes), but that only works if your document is XHTML (unless XSLT has an HTML mode---I can't remember if there's one). Commented Nov 11, 2010 at 16:38
  • 2
    That you wrote "IE" instead of "i.e." confused me to no end! Commented Nov 11, 2010 at 16:45
  • @JasonFruit: lolz! i too got confused. Commented Nov 11, 2010 at 16:47
  • 2
    possible duplicate of How to "Purify" HTML code to prevent XSS attacks in Java or JSP ? Commented Nov 11, 2010 at 17:01

3 Answers 3

11

JSoup has a simple method for sanitizing HTML based on a whitelist. Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:

There are still a number of known XSS attacks that DeXSS does not yet detect.

A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.

Sign up to request clarification or add additional context in comments.

Comments

1

The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.

Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.


Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).

Comments

1

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.

But obviously doing all this involves some overhead if you're doing this at page render time.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.