Sanitising user input using Python

Question

What is the best way to sanitize user input for a Python-based web application? Is there a single function to remove HTML characters and any other necessary characters combinations to prevent an XSS or SQL injection attack?

You should not be attempting to fix SQL injection by sanitising user input! If the database API is used properly there is no chance of SQL injection. — John La Rooy
– John La Rooy, Commented Mar 22, 2010 at 20:52
... if database API is used properly there is no chance of SQL injection. By properly, do you mean use parameterized queries? Does that cover you 100%? — user
– user, Commented Aug 27, 2014 at 15:03
@buffer, I know your comment is old, but if you want other people besides OP to see your comments you have to call them out with an \@ symbol. — user1717828
– user1717828, Commented Oct 27, 2015 at 19:59

tghw · Accepted Answer · 2010-03-25 01:26:22Z

29

Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can't use onclick).

It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja	vascript:alert('hi')"> or <a href="ja vascript:alert('hi')">, etc.)

As you can see, it uses the (awesome) BeautifulSoup library.

import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment

def sanitizeHtml(value, base_url=None):
    rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
    rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
    re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
    validTags = 'p i strong b u a h1 h2 h3 pre br img'.split()
    validAttrs = 'href src width height'.split()
    urlAttrs = 'href src'.split() # Attributes which should have a URL
    soup = BeautifulSoup(value)
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        # Get rid of comments
        comment.extract()
    for tag in soup.findAll(True):
        if tag.name not in validTags:
            tag.hidden = True
        attrs = tag.attrs
        tag.attrs = []
        for attr, val in attrs:
            if attr in validAttrs:
                val = re_scripts.sub('', val) # Remove scripts (vbs & js)
                if attr in urlAttrs:
                    val = urljoin(base_url, val) # Calculate the absolute url
                tag.attrs.append((attr, val))

    return soup.renderContents().decode('utf8')

As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.

edited Mar 25, 2010 at 1:26

answered Aug 24, 2008 at 16:08

tghw

25.4k13 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Gareth Simpson Over a year ago

I upvoted this, but now I'm not so sure. I don't think this protects IE users from src="vbscript:msgbox('xss')" attacks.

tghw Over a year ago

You could easily add that with another regex for vbscript: like the one for javascript:

John La Rooy Over a year ago

@tghw, The vbscript example here is why whitelist solutions are generally preferable to blacklist solutions. How can you know for sure that everything you need is blacklisted? With a blacklist, a new browser could come out next week and be vulnerable because it supports a new type of script tag.

tghw Over a year ago

@gnibbler I agree, and most of that is a whitelist solution, but for href and src, there's really no way to whitelist easily. The only option I can think of would be to make all URLs absolute by passing in the page URL and then going through each link and image and figure out the absolute URL based on the page URL. The more I think of it, the easier this seems. I'll add it above.

rjh Over a year ago

It's generally very hard to sanitize HTML, there are plenty of vectors: nick.cleaton.net/xssrant.html

|

Jonny Buchanan · Accepted Answer · 2011-06-07 19:08:45Z

23

Edit: bleach is a wrapper around html5lib which makes it even easier to use as a whitelist-based sanitiser.

html5lib comes with a whitelist-based HTML sanitiser - it's easy to subclass it to restrict the tags and attributes users are allowed to use on your site, and it even attempts to sanitise CSS if you're allowing use of the style attribute.

Here's now I'm using it in my Stack Overflow clone's sanitize_html utility function:

http://code.google.com/p/soclone/source/browse/trunk/soclone/utils/html.py

I've thrown all the attacks listed in ha.ckers.org's XSS Cheatsheet (which are handily available in XML format at it after performing Markdown to HTML conversion using python-markdown2 and it seems to have held up ok.

The WMD editor component which Stackoverflow currently uses is a problem, though - I actually had to disable JavaScript in order to test the XSS Cheatsheet attacks, as pasting them all into WMD ended up giving me alert boxes and blanking out the page.

edited Jun 7, 2011 at 19:08

answered Oct 30, 2008 at 0:41

Jonny Buchanan

62.9k17 gold badges147 silver badges150 bronze badges

1 Comment

Che Over a year ago

2023 Update: bleach is deprecated. The new recommendation seems to be nh3.

user17898 · Accepted Answer · 2008-09-18 15:56:09Z

13

The best way to prevent XSS is not to try and filter everything, but rather to simply do HTML Entity encoding. For example, automatically turn < into <. This is the ideal solution assuming you don't need to accept any html input (outside of forum/comment areas where it is used as markup, it should be pretty rare to need to accept HTML); there are so many permutations via alternate encodings that anything but an ultra-restrictive whitelist (a-z,A-Z,0-9 for example) is going to let something through.

SQL Injection, contrary to other opinion, is still possible, if you are just building out a query string. For example, if you are just concatenating an incoming parameter onto a query string, you will have SQL Injection. The best way to protect against this is also not filtering, but rather to religiously use parameterized queries and NEVER concatenate user input.

This is not to say that filtering isn't still a best practice, but in terms of SQL Injection and XSS, you will be far more protected if you religiously use Parameterize Queries and HTML Entity Encoding.

answered Sep 18, 2008 at 15:56

user17898

1392 bronze badges

1 Comment

Purrell Over a year ago

This is not correct in many cases. See OSWAP notes on "Why Can't I Just HTML Entity Encode Untrusted Data?" owasp.org/index.php/…

Community · Accepted Answer · 2021-01-18 12:38:11Z

6

Jeff Atwood himself described how StackOverflow.com sanitizes user input (in non-language-specific terms) on the Stack Overflow blog: https://blog.stackoverflow.com/2008/06/safe-html-and-xss/

However, as Justin points out, if you use Django templates or something similar then they probably sanitize your HTML output anyway.

SQL injection also shouldn't be a concern. All of Python's database libraries (MySQLdb, cx_Oracle, etc) always sanitize the parameters you pass. These libraries are used by all of Python's object-relational mappers (such as Django models), so you don't need to worry about sanitation there either.

edited Jan 18, 2021 at 12:38

CommunityBot

11 silver badge

answered Aug 19, 2008 at 20:51

Eli Courtwright

195k69 gold badges224 silver badges257 bronze badges

Comments

Henrik Gustafsson · Accepted Answer · 2008-08-24 16:23:13Z

I don't do web development much any longer, but when I did, I did something like so:

When no parsing is supposed to happen, I usually just escape the data to not interfere with the database when I store it, and escape everything I read up from the database to not interfere with html when I display it (cgi.escape() in python).

Chances are, if someone tried to input html characters or stuff, they actually wanted that to be displayed as text anyway. If they didn't, well tough :)

In short always escape what can affect the current target for the data.

When I did need some parsing (markup or whatever) I usually tried to keep that language in a non-intersecting set with html so I could still just store it suitably escaped (after validating for syntax errors) and parse it to html when displaying without having to worry about the data the user put in there interfering with your html.

1 Comment

AthulMuralidhar Over a year ago

I dont think django's builtins can prevent XSS attacks, acorrding to the doc here: docs.djangoproject.com/en/3.2/ref/templates/builtins/#striptags. I suggest bleach

Mr. Napik · Accepted Answer · 2009-10-01 12:21:45Z

0

To sanitize a string input which you want to store to the database (for example a customer name) you need either to escape it or plainly remove any quotes (', ") from it. This effectively prevents classical SQL injection which can happen if you are assembling an SQL query from strings passed by the user.

For example (if it is acceptable to remove quotes completely):

datasetName = datasetName.replace("'","").replace('"',"")

answered Oct 1, 2009 at 12:21

Mr. Napik

5,7073 gold badges27 silver badges19 bronze badges

1 Comment

SingleNegationElimination Over a year ago

Er... no... I would still not do this. For everything that is a data item, use parameterized queries. For non-data (dynamically built queries), you should really definately be using a whitelist. pg_catalog.pg_user contains no quotes, but you probably dont want that in your generated queries, either. Instead do something like datasetName = datasetName if datasetName in DATASETNAME_WHITELIST else sulk()

Collectives™ on Stack Overflow

Sanitising user input using Python

7 Answers 7

7 Comments

1 Comment

1 Comment

Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

7 Comments

1 Comment

1 Comment

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related