I'm trying to code a secure and lightweight white-list based HTML purifier which will use DOMDocument. In order to avoid unnecessary complexity I am willing to make the following compromises:
- HTML comments are removed
scriptandstyletags are stripped all together- only the child nodes of the
bodytag will be returned - all HTML attributes that can trigger Javascript events will either be validated or removed
I've been reading a lot about on XSS attacks and prevention and I hope I'm not being too naive (if I am, please let me know!) in assuming that if I follow all the rules I mentioned above, I will be safe from XSS.
The problem is I am not sure what other tags and attributes (in any [X]HTML version and/or browser versions/implementations) can trigger Javascript events, besides the default Javascript event attributes:
onAbortonBluronChangeonClickonDblClickonDragDroponErroronFocusonKeyDownonKeyPressonKeyUponLoadonMouseDownonMouseMoveonMouseOutonMouseOveronMouseUponMoveonResetonResizeonSelectonSubmitonUnload
Are there any other non-default or proprietary event attributes that can trigger Javascript (or VBScript, etc...) events or code execution? I can think of href, style and action, for instance:
<a href="javascript:alert(document.location);">XSS</a> // or
<b style="width: expression(alert(document.location));">XSS</b> // or
<form action="javascript:alert(document.location);"><input type="submit" /></form>
I will probably just remove any style attributes in the HTML tags, the action and href attributes pose a bigger challenge but I think the following code is enough to make sure their value is either a relative or absolute URL and not some nasty Javascript code:
$value = $attribute->value;
if ((strpos($value, ':') !== false) && (preg_match('~^(?:(?:s?f|ht)tps?|mailto):~i', $value) == 0))
{
$node->removeAttributeNode($attribute);
}
So, my two obvious questions are:
- Am I missing any tags or attributes that can trigger events?
- Is there any attack vector that is not covered by these rules?
After a lot of testing, pondering and researching I've come up with the following (rather simple) implementation which, appears to be immune to any XSS attack vector I could throw at it.
I highly appreciate all your valuable answers, thanks.
http:jascript:alert(....scriptandstylewill always be removed however). Tag attributes can be white-listed or not (allow all attributes, which should be internally sanitized or black-listed). If you allow theatag, you probably also need to allow thehrefattribute and you still have the same problem - that's why I though on a second-pass black-list approach, since white-listing all possible tag attribute values would be way too cumbersome and highly susceptible to human error.