5

My question relates to the innards of how Postgres works:

I have a table:


CREATE TABLE A (
   id SERIAL,  
   name VARCHAR(32),
   type VARCHAR(32) NOT NULL, 
   priority SMALLINT NOT NULL,
   x SMALLINT NOT NULL,
   y SMALLINT NOT NULL,
   start timestamp with time zone,
   end timestamp with time zone,
   state Astate NOT NULL,
   other_table_id1 bigint REFERENCES W,
   other_table_id2 bigint NOT NULL REFERENCES S,
   PRIMARY KEY(id)
); 

with additional indexes on other_table_id1, state and other_table_id2.

The table is quite large and sees very many updates on columns: other_table_id1, state. A few updates for start and end columns, but the rest are immutable. (Astate is an enumerated type for column state.)

I'm wondering whether it makes sense to split out the two most frequently updated columns to a separate table. What I'm hoping to gain is performance, for when I'm just looking up that info, or to reduce the weight of updates because (maybe?) reading and writing the shorter row is less costly. But I need to weigh that against the cost of joins when they are (occasionally) needed to have all the data for a particular item all at once.

At one point, i was under the impression that each column is stored separately. But later, I modified my thinking when I read somewhere that lessening the width of a column on one side of the table does positively affect the performance when looking up data using another column (because the row is stored together, so the overall row length would be shorter). So I'm now under the impression that all the data for a row are physically stored together on disk; so the proposed splitting of the table sounds like it would be helpful. When I currently write 4 bytes to update the state, am I to believe I'm rewriting the 64 bytes of text (name, type) that actually never change?

I'm not very experienced with table "normalization" and not familiar with the internals of Postgres, so I'm looking for advice and esp best practices for estimating the tradeoff without having to do the work first, then determine whether that work was worthwhile. The change would require a fair bit of effort in rewriting queries that have already been highly optimized, so I would rather go in with a good understanding of what result I can expect. Thanks, M.

3 Answers 3

4

There is a definite cost to updating a larger row.

A formula can help with this. If you do not split, your costs are

Cost = xU + yS

where:

U = an update of the entire row (table is not split)

S = cost of a select

x,y = count of actions

Then, if you split it, you are trying to figure this:

Cost = gU1 + hU2 + xS1 + yS2

where

U1 = update of smaller table (lower cost)

U2 = update of larger table (lower cost)

S1 = select from smaller table

S2 = select from larger table

g,h,x,y = how often the individual actions occur

So if g >> h, it pays to break them up. Especially if x >> y then it really pays.

EDIT: In response to comments, I would also point out that these costs become far more important if the database is under sustained load, no inactivity. If instead the server does not experience sustained load, it is mostly inactive with just 1 or 2 trx per second, with long stretches (where "long" = a few seconds) of inactivity, then, if it were me, I would not complicate my code because the performance benefit would not appear as a real measurable thing.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Ken, this is in the direction I'm trying to go. Would you say that U1 and U2 (and U) are all proportional to the width of the columns? Would you say that (U1 + U2) == U within some margin of overhead (U1 + U2 < 1.2 * U)?
And since there's a 1:1 relationship between the rows in the two tables, would you say that a join costs (U1+U2)?
@Mayur, some years ago I did heavy write testing and found that U1 and U2 are definitely proportional to row size under sustained load. The key point is sustained load, if your updates are 'frequent', but 'frequennt' means once/20 seconds, and the database is often inactive for seconds at a time, then there is probably nothing to be gained by splitting the page.
@Mayur, In the first approximation, yes, a join costs S1+S2. It is muddied by the fact that we have indexes, caching, and so forth, but the number of operations is the foundation of calculating the costs.
2

One of Postgresql's implementation details that has a bearing on this is that it never "updates" rows stored on disk, it always writes new versions. So there is no quick win by having fixed-width columns together at the start like there is with Oracle, for example (iirc).

It is true that grouping columns together in different tables based on whether they tend to be updated together can produce less garbage that has to be vacuumed up. Experimenting and measuring results is key here. If you have some data that is updated frequently, you should investigate the "fillfactor" setting on a table, for example. This setting makes PostgreSQL leave some free space in table pages when inserting, allowing updated versions of rows to be added to the same page as the previous version where possible: this can reduce the burden of an update as it may mean the indices pointing at the row don't have to be updated, at the expense of having the table take up more space on disk overall.

As mentioned by Xaade, there is a lot of material to get into on this subject. I'd like to reinforce my comment about needing to measure the impact of any changes made.. sometimes what can seem like a big win turns out not to be in practice.

Comments

0

Its worth splitting it up no matter how the columns are stored. You'd run into a lot less problems with concurrency, speed up lookups on partial data, speed up index searching by providing three indexes to search with without having to make those secondary keys, and so forth.

You can lessen the impact of inner joins by either cheating, or by only allowing so many rows viewed at one time. You can cheat by providing an interface rather than allowing direct lookups, only displaying inner joined data on visible rows (you can only view so many rows on the screen at one time), or by displaying additional data for the currently selected row, or by only allowing X rows per lookup with browse buttons. If you use the cheat, make sure you cache the results of an extended lookup.

2 Comments

Thanks. The suggestion about "cheating" isn't really relevant to my application because humans rarely sort through these tables. It's all for the machine, so already the columns aren't sought unless they really need to be seen. (i.e. SELECT id, state, other_table_id2 WHEN...) Is there a means of estimating the increased cost of the join, or the savings of the reduced table size?
Look into Database Management. It's a whole career field, because it takes a lot of analysis to determine these things. You have to consider not only estimating the opportunity cost of your solution, but of every solution. You have to estimate when data gets updated, splitting up tables may allow for less collisions. You have to calculate this for every possible combination to find the most efficient solution. However, when working with large amounts of data where the INNER JOIN would be only occasional and parts of the table can be updated independently, it's always beneficial to split up.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.