65

What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?

3
  • 10
    You should not be attempting to fix SQL injection by sanitising user input! If the database API is used properly there is no chance of SQL injection. Commented Mar 22, 2010 at 20:52
  • 4
    ... if database API is used properly there is no chance of SQL injection. By properly, do you mean use parameterized queries? Does that cover you 100%? Commented Aug 27, 2014 at 15:03
  • 2
    @buffer, I know your comment is old, but if you want other people besides OP to see your comments you have to call them out with an \@ symbol. Commented Oct 27, 2015 at 19:59

7 Answers 7

29

Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).

It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja&#x09;vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)

As you can see, it uses the (awesome) BeautifulSoup library.

import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment

def sanitizeHtml(value, base_url=None):
    rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
    rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
    re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
    validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
    validAttrs = 'href src width height'.split()
    urlAttrs = 'href src'.split() # Attributes which should have a URL
    soup = BeautifulSoup(value)
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        # Get rid of comments
        comment.extract()
    for tag in soup.findAll(True):
        if tag.name not in validTags:
            tag.hidden = True
        attrs = tag.attrs
        tag.attrs = []
        for attr, val in attrs:
            if attr in validAttrs:
                val = re_scripts.sub('', val) # Remove scripts (vbs & js)
                if attr in urlAttrs:
                    val = urljoin(base_url, val) # Calculate the absolute url
                tag.attrs.append((attr, val))

    return soup.renderContents().decode('utf8')

As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.

Sign up to request clarification or add additional context in comments.

7 Comments

I upvoted this, but now I'm not so sure. I don't think this protects IE users from src="vbscript:msgbox('xss')" attacks.
You could easily add that with another regex for vbscript: like the one for javascript:
@tghw, The vbscript example here is why whitelist solutions are generally preferable to blacklist solutions. How can you know for sure that everything you need is blacklisted? With a blacklist, a new browser could come out next week and be vulnerable because it supports a new type of script tag.
@gnibbler I agree, and most of that is a whitelist solution, but for href and src, there's really no way to whitelist easily. The only option I can think of would be to make all URLs absolute by passing in the page URL and then going through each link and image and figure out the absolute URL based on the page URL. The more I think of it, the easier this seems. I'll add it above.
It's generally very hard to sanitize HTML, there are plenty of vectors: nick.cleaton.net/xssrant.html
|
23

Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.

html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.

Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:

http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py

I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.

The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.

1 Comment

2023 Update: bleach is deprecated. The new recommendation seems to be nh3.
13

The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into &lt;. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.

SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.

This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.

1 Comment

This is not correct in many cases. See OSWAP notes on "Why Can't I Just HTML Entity Encode Untrusted Data?" owasp.org/index.php/…
6

Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/

However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.

SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.

Comments

4

I don't do web development much any longer, but when I did, I did something like so:

When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).

Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)

In short always escape what can affect the current target for the data.

When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.

See also Escaping HTML

Comments

0

If you are using a framework like django, the framework can easily do this for you using standard filters. In fact, I'm pretty sure django automatically does it unless you tell it not to.

Otherwise, I would recommend using some sort of regex validation before accepting inputs from forms. I don't think there's a silver bullet for your problem, but using the re module, you should be able to construct what you need.

1 Comment

I dont think django's builtins can prevent XSS attacks, acorrding to the doc here: docs.djangoproject.com/en/3.2/ref/templates/builtins/#striptags. I suggest bleach
0

To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.

For example (if it is acceptable to remove quotes completely):

datasetName = datasetName.replace("'","").replace('"',"")

1 Comment

Er... no... I would still not do this. For everything that is a data item, use parameterized queries. For non-data (dynamically built queries), you should really definately be using a whitelist. pg_catalog.pg_user contains no quotes, but you probably dont want that in your generated queries, either. Instead do something like datasetName = datasetName if datasetName in DATASETNAME_WHITELIST else sulk()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.