0

I'm using CKEditor to let users enter rich text and even embedded images. That content is sent to other users. How can I prevent any kind of malicious injection like XSS? I think I just need to clean the HTML removing all possible scripting at server side, but I can't find any tested tool to do that. Even GWT's SafeHTMLUtils won't work cause it modifies the HTML too much breaking user intended input.

Edit:

I've found a sanitizer called Jsoup. It does exactly what I need. But even in relaxed mode it's removing img tags with embedded images.

2 Answers 2

2

I managed to clean my HTML input with Jsoup this way:

Jsoup.clean(dirtyHTML, 
                Whitelist.relaxed()
                .addProtocols("img","src","data")
                .addAttributes(":all", "style")
                .addTags("span")));

It accepts any img with src content starting with "data:". It's ok for now, but I asked a question to find a way to just accept the CKEditor generated content "data:;base64".

To display the sanitized HTML data to the receiving user we are using a sandboxed iframe to avoid css disasters (like a fixed position image covering all the page).

<iframe sandbox="allow-same-origin">Sanitized HTML here inside body tag</iframe>
Sign up to request clarification or add additional context in comments.

Comments

1

It is very hard to separate good HTML from bad one in an automatic way. I would not trust any tool even they claim to be secure. Such a separation would not be limited to checking which tags or attributes are used and block some like script tag or event handler attributes (like img.onerror). There are lots of techniques that benefit from browser's way of parsing/handling HTML. New exploit methods are introduced every day.

I believe the safest way is to use a Markdown editors, like the one used here on Stackoverflow.

You can find some references here: JQuery/JS Markdown plugin?

5 Comments

Thanks for the info. I've been reading about PageDown used here. But "It should be noted that Markdown is not safe as far as user-entered input goes. Pretty much anything is valid in Markdown, in particular something like <script>doEvil();</script>. This PageDown repository includes the two plugins that Stack Exchange uses to sanitize the user's input; see the description of Markdown.Sanitizer.js below". I think we have no other solution than trust in some sanitizer tool.
I think it would be easier to use Markdown + A sanitizer that removes html completely. In addition to removing (or trying to remove) HTML from user input, this sanitizer can htmlencode given input, then apply markdown rules to add some html. This way, it is guaranteed that even if user could pass some from removal phase, that html will be encoded in output.
I can't completely remove HTML in my case. The whole point of the functionality is to let users send HTML ready articles to other users. I think I will be ok with something like jsoup cleaning just scripts, but I would like to keep embedded images.
I am not insisting, please do not get me wrong. I just want to be clear. When markdown is used no html tags appear in user input. This is the way markdown works. It has some conventions like when a word appears between two stars (*) it should be rendered as bold. So normally, user supplied markdown data does not include any html. Sanitiser can remove whole HTML at this moment. Then it HTML encodes input string. Then converts markdown conventions to real html tags (like converting * to <b>).
According to this: michelf.ca/blog/2010/markdown-and-xss the problem is still sanitizing HTML

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.