0

Is there an especially efficient way to bulk-update a simple boolean field in Postgres for a large number or records?

I have a table containing millions of rows, and occasionally I want to mark a fresh=false on a large but well-indexed subset of those rows.

However, if I try to do the obvious:

UPDATE mytable SET fresh=false WHERE mycriteria;

it runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.

Instead, I've written a bash script to run this update in mini-chunks of a few thousand records at a time, which still takes hours, but at least gets the job done and gives me progress information to boot. Is there a better way?

4
  • UPDATE mytable SET fresh=false WHERE fresh=true AND mycriteria; Commented May 21, 2014 at 17:58
  • @wildplasser, Yes, that's included in mycriteria. Even with that, I still have the problem. Commented May 21, 2014 at 18:15
  • Have you allocated enough memory (shared buffers) to Postgres. If yes, I do not think there is much you can do. For my developer machine and huge tables, I usually have to do a copy with the extra value and then DROP and recreate the table with the added field. Commented May 21, 2014 at 19:29
  • 1
    What does EXPLAIN (VERBOSE) UPDATE mytable SET fresh=false WHERE mycriteria report? What version of PostgreSQL and what OS? It is not unusual for such an update to be slow and to use a lot of disk space, but it should not exhaust RAM unless there is a bug or your settings are way off. Do you have constraints on the table and are they deferred? Commented May 21, 2014 at 19:54

1 Answer 1

2

It runs for hours, consumes all memory, starts to swap, rendering my machine nearly unusable, forcing me to kill the process, resulting in no change to any data in the database.

Based on the description that you probably have AFTER UPDATE ... FOR EACH ROW triggers defined.

At present, PostgreSQL (true in 9.4 and prior, at least) uses an in-memory queue for triggers. It's an efficient queue, but it's still in-memory, and after a few million rows that starts to really add up.

To confirm that this is the case you should attach gdb to the postgres process doing the work once it's getting big, using gdb -p the-big-postgres-process-id, e.g. gdb -p 1234 if 1234 is the pid of the postgres that shows up as using lots of RAM in top. Or you can SELECT pg_backend_pid() before running your UPDATE.

Either way, once you've got gdb attached and you're at the (gdb) prompt run:

(gdb) p MemoryContextStats(TopMemoryContext)
(gdb) detach
(gdb) quit

If gdb complains about missing symbols you may have to install a debuginfo package first; see the instructions on the wiki.

This will confirm where the memory is really going.

If this does turn out to be AFTER UPDATE ... FOR EACH ROW triggers, your options are:

  • Use a FOR EACH STATEMENT trigger instead. There's no way to get NEW and OLD in that case.

  • Use a BEFORE trigger; or

  • Sponsor development of spill-to-disk-storage for the AFTER trigger queue ;-)

BTW, one thing to keep in mind is that if you have a 100-column wide table and you update one field, every column still has to be copied and written to the new row copy because of MVCC. The exception is TOASTable columns that're stored out-of-line (non-trivial text fields, arrays, bytea fields, etc etc); if not modified they don't have to be copied. So a "trivial" update may not be as trivial as you think.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.