in postgresql, are partitions or multiple databases more efficient?

Question

have an application in which many companies post information. the data from each company is self contained - there is no data overlap.

performance-wise, is it better to:

keep the company ID on each row of each table and have each index use it?
partition each table according to the company ID
partition and create a user to access each company to ensure security
create multiple databases, one for each company

web-based application with persistent connections.

my thoughts:

new pg connections are expensive, so a single database creates less new connections
having only one copy of the dictionary seems more efficient than 200 or so
multiple databases are certainly safer from programmer error
if application specs should change so companies share, multiple data base would be difficult to implement

Craig Ringer · Accepted Answer · 2011-12-09 01:04:24Z

16

I'd recommend searching for info on the PostgreSQL mailing lists about multi-tenanted design. There's been lots of discussion there, and the answer boils down to "it depends". There are trade-offs every way between guaranteed isolation, performance, and maintainability.

A common approach is to use a single database, but one schema (namespace) per customer with the same table structure in each schema, plus a shared or common schema for data that's the same across all of them. A PostgreSQL schema is like a MySQL "database" in that you can query across different schema but they're isolated by default. With customer data in separate schema you can use the search_path setting, usually via ALTER USER customername SET search_path = 'customerschema, sharedschema' to ensure each customer sees their data and only their data.

For additional protection, you should REVOKE ALL FROM SCHEMA customerschema FROM public then GRANTALL ON SCHEMA customerschema TO thecustomer so they're the only one with any access to it, doing the same to each of their tables. Your connection pool then can log in with a fixed user account that has no GRANTed access to any customer schema but has the right to SET ROLE to become any customer. (Do that by giving them membership of each customer role with NOINHERIT set so rights have to be explicitly claimed via SET ROLE). The connection should immediately SET ROLE to the customer it's currently operating as. That'll allow you to avoid the overhead of making new connections for each customer while maintaining strong protection against programmer error leading to access to the wrong customer's data. So long as the pool does a DISCARD ALL and/or a RESET ROLE before handing connections out to the next client, that's going to give you very strong isolation without the frustration of individual connections per-user.

If your web app environment doesn't have a decent connection pool built-in (say, you're using PHP with persistent connections) then you really need to put a good connection pool in place between Pg and the web server anyway, because too many connections to the backend will hurt your performance. PgBouncer and PgPool-II are the best options, and handily can take care of doing the DISCARD ALL and RESET ROLE for you during connection hand-off.

The main downside of this approach is the overhead with maintaining that many tables, since your base set of non-shared tables is cloned for each customer. It'll add up as customer numbers grow, to the point where the sheer number of tables to examine during autovacuum runs starts to get expensive and where any operation that scales based on the total number of tables in the DB slows down. This is more of an issue if you're thinking of having many thousands or tens of thousands of customers in the same DB, but I strongly recommend you do some scaling tests with this design using dummy data before committing to it.

The ideal approach is likely to be single tables with automatic row-level security controlling tuple visibility, but unfortunately that's something PostgreSQL doesn't have yet. It looks like it's on the way thanks to the SEPostgreSQL work adding suitable infrastructure and APIs, but it's not in 9.1.

edited Dec 9, 2011 at 1:04

answered Dec 9, 2011 at 0:27

Craig Ringer

329k83 gold badges742 silver badges820 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

cc young Over a year ago

thanks much!! (sorry, have been working with MySQL lately and it made me brain dead.) schema should be an option over multiple databases - in fact, have been using this for other projects. great idea of setting the role after connect. have been using set path, but combination of two is best.

Craig Ringer Over a year ago

Yep, setting the role lets you use database-level security without so much pain. It's great.

Craig Ringer Over a year ago

... and remember to use a decent connection pool like PgPool-II or PgBouncer if you're using something primitive on the web server side like PHP with persistent connections. There's no need if you're using something like a Java app server that does its own connection pooling in-server.

cc young Over a year ago

surprised that there would significant difference between pooled connection technologies for nothing-fancy type access

Craig Ringer Over a year ago

@ccyoung Yep, that's why I mentioned it - most don't expect it. The reason is that Pg's core can't natively queue queries. There's a 1:1 mapping between connection=session=executor . Every connection has an executor engine, and every executor engine can be running at once. There's a sweet-spot for number of working executors (usually somewhere around num_cpusnum_hdds) and above that, adding more *slows Pg not speeds it up. So you use a connection pool to limit the load to that sweet spot and to queue work.

Collectives™ on Stack Overflow

in postgresql, are partitions or multiple databases more efficient?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related