3

I am solving an performance issue on PostgreSQL 9.6 dbo based system. Intro:

12yo system, similar to banking system, with most queried primary table called transactions.

CREATE TABLE jrn.transactions (
     ID BIGSERIAL,
     type_id VARCHAR(200),
     account_id INT NOT NULL,
     date_issued DATE,
     date_accounted DATE,
     amount NUMERIC,
     ..
)

In the table transactions we store all transactions within a bank account. Field type_id determines the type of a transaction. Servers also as C# EntityFramework Discriminator column. Values are like:

card_payment, cash_withdrawl, cash_in, ...

14 types of transaction are known.

In generally, there are 4 types of queries (no. 3 and .4 are by far most frequent):

  1. select single transaction like: SELECT * FROM jrn.transactions WHERE id = 3748734

  2. select single transaction with JOIN to other transaction like: SELECT * FROM jrn.transactions AS m INNER JOIN jrn.transactions AS r ON m.refund_id = r.id WHERE m.id = 3748734

  3. select 0-100, 100-200, .. transactions of given type like: SELECT * FROM jrn.transactions WHERE account_id = 43784 AND type_id = 'card_payment' LIMIT 100

  4. several aggregate queries, like: SELECT SUM(amount), MIN(date_issued), MAX(date_issued) FROM jrn.transactions WHERE account_id = 3748734 AND date_issued >= '2017-01-01'

In last few month we had unexpected row count growth, now 120M.

We are thinking of table partitioning, following to PostgreSQL doc: https://www.postgresql.org/docs/10/static/ddl-partitioning.html

Options:

  1. partition table by type_id into 14 partitions
  2. add column year and partition table by year (or year_month) into 12 (or 144) partitions.

I am now restoring data into out test environment, I am going to test both options.

What do you consider the most appropriate partitioning rule for such situation? Any other options?

Thanks for any feedback / advice etc.

12
  • If queries 3 and 4 are the most frequent one's, we should try to optimize partitioning on them. Their WHERE conditions include three columns account_id, type_id and date_issued. The selectivity of account_id is very high, so it most likely should go better to an index. date_issued is used with non-equal operators, thus partitioning would not help much. type_id has no selectivity, so could be ok, but is not used in query 4. ==> there is no obvious solution (at least to me). Commented Feb 25, 2018 at 11:51
  • Note that you are referring to the docs of Postgresql 10, whilst your server is on 9.6. Commented Feb 25, 2018 at 11:53
  • On option 2: As query 4 does not contain year(or year_mongth), this does not help much either. Anyway, query 3 is not limited w.r.t. date_issued, so this would also not be an optimal solution. Commented Feb 25, 2018 at 11:57
  • I am currently thinking into the direction of a materialized view (e.g. for handling the aggregating query no. 4). @Luke1988: What is the update/insert frequency to jrn.transactions and how important is accuracy of sum(amount) in query 4? Commented Feb 25, 2018 at 11:59
  • 1
    Yes, because they are the most important foreign keys. Combined with one of the date fields, they almost construct the natural key for the table. Commented Feb 25, 2018 at 16:03

2 Answers 2

2

Partitioning won't be very helpful with these queries, since they won't perform a sequential scan, unless you forgot an index.

The only good reason I see for partitioning would be if you want to delete old rows efficiently; then partitioning by date would be best.

Based on your queries, you should have these indexes (apart from the primary key index):

CREATE INDEX ON jrn.transactions (account_id, date_issued);
CREATE INDEX ON jrn.transactions (refund_id);

The following index might be a good idea if you can sacrifice some insert performance to make the third query as fast as possible (you might want to test):

CREATE INDEX ON jrn.transactions (account_id, type_id);
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! As far as I understand, partitioning wont be helpful due to account_id? If there is no account_id column (just theory now), then, partition by type_id would make a sence? I could possibly redirect query no 3. to this partiotion directly - SELECT * FROM jrn.transactions_cash_withdrawl LIMIT 100
Yes, if you partition by type_id, it would only scan that one partition. But if you have the appropriate index and use an index scan, that won't be much cheaper than on the big table. Partitioning only helps with plans that use a sequential scan and with mass deletion.
0

What you have here is almost a perfect case for column-based storage as you may get it using a SAP HANA Database. However, as you explicitly have asked for a Postgres answer and I doubt that a HANA database will be within the budget limit, we will have to stick with Postgres.

Your two queries no. 3 and 4 go quite into different directions, so there won't be "the single answer" to your problem - you will always have to balance somehow between these two use cases. Yet, I would try to use two different techniques to approach each of them individually.

From my perspective, the biggest problem is the query no. 4, which creates quite a high load on your postgres server just because it is summing up values. Moreover, you are just summing up values over and over again, which most likely won't change often (or even at all), as you have said that UPDATEs nearly do not happen at all. I furthermore assume two more things:

  • transactions is INSERT-only, i.e. DELETE statements almost never happen (besides perhaps in cases of some exceptional administrative intervention).
  • The values of column date_issued when INSERTing typically are somewhere "close to today" - so you usually won't INSERT stuff way in the past.

Out of this, to prevent aggregating values over and over again unnecessarily, I would introduce yet another table: let's call it transactions_aggr, which is built up like this:

create table transactions_aggr (
   account_id INT NOT NULL,
   date_issued DATE,
   sumamount NUMERIC,
   primary key (account_id, date_issued)
)

which will give you a table of per-day preaggregated values. To determine which values are already preaggregated, I would add another boolean-typed column to transactions, which indicates to me, which of the rows are contained in transactions_aggr and which are not (yet). The query no. 4 then would have to be changed in such a way that it reads only non-preaggregated rows from transactions, whilst the rest could come from transactions_aggr. To facilitate that you could define a view like this:

select account_id, date_issued, sum(amount) as sumamount from
    (
    select account_id, date_issued, sumamount as amount from transactions_aggr as aggr
    union all
    select account_id, date_issued, sum(amount) as amount from transactions as t where t.aggregated = false
    )
group by account_id, date_issued

Needless to say that putting an index on transactions.aggregated (perhaps in conjunction with the account_id) could greatly help to improve the performance here.

Updating transactions_aggr can be done using multiple approaches:

  1. You could use this as a one-time activity and only pre-aggregate the current set of ~120m rows once. This would at least reduce the load on your machine doing aggregations significantly. However, over time you will run into the same problem again. Then you may just re-execute the entire procedure, simply dropping transactions_aggr as a whole and re-create it from scratch (all the original data still is there in transactions).

  2. You have a nice period somewhere during the week/month/in the night, where you have little or no queries are coming in. Then you can open a transaction, read all transactions WHERE aggregated = false and add them with UPDATEs to transactions_aggr. Keep in mind to then toggle aggregated to true (should be done in the same transaction). The tricky part of this, however, is that you must pay attention to what reading queries will "see" of this transaction: Depending on your requirements of accuracy during that timeframe of this "update job", you may have to consider switching the transaction isolation level to "READ_COMMITED" to prevent ghost reads.

On the matter of your query no. 3 you then could try to really go for the approach of partitioning based on type_id. However, I perceive your query as a little strange, as you are performing a LIMIT/OFFSET without ordering (e.g. there is no ORDER BY statement in place) having specified (NB: You are not saying that you would be using database cursors). This may lead to the effect that the implicit order, which is currently used, is changed, if you enable partitioning on the table. So be careful on side-effects which this may cause on your program. And one more thing: Before really doing the partition split, I would first check on the data distribution concerning type_id by issuing

select type_id, count(*) from transactions group by type_id

Not that it turns out that, for example, 90% of your data is with card_payment - so that you will have a heavily uneven distribution amongst your partitions and the biggest performance hogging queries are those which would still go into this single "large partition".

Hope this helps a little - and good luck!

1 Comment

Thanks for the effort! I ran that query. Data distribution is almost equal, ca. from 5 to 10% per type. One type is used very little, and two types are used more frequently, but not more than 20%. Bad think about transaction_aggregations is that the query is generated by the application layer and for now, any modification of the application is problematic and not in our hands. About query order: we use implicit order because we list transactions as they comes into system, so natural primary key order works fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.