Efficiently bulk-updated boolean field in PostgreSQL

Question

Is there an especially efficient way to bulk-update a simple boolean field in Postgres for a large number or records?

I have a table containing millions of rows, and occasionally I want to mark a fresh=false on a large but well-indexed subset of those rows.

However, if I try to do the obvious:

UPDATE mytable SET fresh=false WHERE mycriteria;

it runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.

Instead, I've written a bash script to run this update in mini-chunks of a few thousand records at a time, which still takes hours, but at least gets the job done and gives me progress information to boot. Is there a better way?

UPDATE mytable SET fresh=false WHERE fresh=true AND mycriteria; — wildplasser
– wildplasser, Commented May 21, 2014 at 17:58
@wildplasser, Yes, that's included in mycriteria. Even with that, I still have the problem. — Cerin
– Cerin, Commented May 21, 2014 at 18:15
Have you allocated enough memory (shared buffers) to Postgres. If yes, I do not think there is much you can do. For my developer machine and huge tables, I usually have to do a copy with the extra value and then DROP and recreate the table with the added field. — Alexandros
– Alexandros, Commented May 21, 2014 at 19:29
What does EXPLAIN (VERBOSE) UPDATE mytable SET fresh=false WHERE mycriteria report? What version of PostgreSQL and what OS? It is not unusual for such an update to be slow and to use a lot of disk space, but it should not exhaust RAM unless there is a bug or your settings are way off. Do you have constraints on the table and are they deferred? — jjanes
– jjanes, Commented May 21, 2014 at 19:54

Craig Ringer · Accepted Answer · 2014-05-22 14:06:58Z

It runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.

Based on the description that you probably have AFTER UPDATE ... FOR EACH ROW triggers defined.

At present, PostgreSQL (true in 9.4 and prior, at least) uses an in-memory queue for triggers. It's an efficient queue, but it's still in-memory, and after a few million rows that starts to really add up.

To confirm that this is the case you should attach gdb to the postgres process doing the work once it's getting big, using gdb -p the-big-postgres-process-id, e.g. gdb -p 1234 if 1234 is the pid of the postgres that shows up as using lots of RAM in top. Or you can SELECT pg_backend_pid() before running your UPDATE.

Either way, once you've got gdb attached and you're at the (gdb) prompt run:

(gdb) p MemoryContextStats(TopMemoryContext)
(gdb) detach
(gdb) quit

If gdb complains about missing symbols you may have to install a debuginfo package first; see the instructions on the wiki.

This will confirm where the memory is really going.

If this does turn out to be AFTER UPDATE ... FOR EACH ROW triggers, your options are:

Use a FOR EACH STATEMENT trigger instead. There's no way to get NEW and OLD in that case.
Use a BEFORE trigger; or
Sponsor development of spill-to-disk-storage for the AFTER trigger queue ;-)

BTW, one thing to keep in mind is that if you have a 100-column wide table and you update one field, every column still has to be copied and written to the new row copy because of MVCC. The exception is TOASTable columns that're stored out-of-line (non-trivial text fields, arrays, bytea fields, etc etc); if not modified they don't have to be copied. So a "trivial" update may not be as trivial as you think.

Collectives™ on Stack Overflow

Efficiently bulk-updated boolean field in PostgreSQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related