Optimizing Performance for Large Table Updates in PostgreSQL

Question

I used to update a very large table using UPDATE queries, but they were taking too long to execute. To improve performance, I switched to using the CREATE TABLE approach and adding indexes to update the table. This approach has significantly increased my query execution speed, but I want to understand its scalability and limitations.

Server Specifications:

PostgreSQL version: 15.6
RAM: 32 GB
Cores: 16
Disk Space: SSD 250 GB (50% free)
OS: Linux Ubuntu 22.04

PostgreSQL Configuration:

max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5242kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4

Table Details:

Table Name	Row Count	Size
source_switchdata_tmp_details	60 Million	30 GB
source_npcidata_tmp_details	60 Million	30 GB
source_aepscbsdata_tmp_details	60 Million	30 GB

Query:

BEGIN;
ALTER TABLE source_switchdata_tmp_details RENAME TO source_switchdata_tmp_details_og;
CREATE TABLE source_switchdata_tmp_details AS
SELECT DISTINCT ON (A.uniqueid) A.transactiondate,
        A.cycles,
        A.transactionamount,
        A.bcid,
        A.bcname,
        A.username,
        A.terminalid,
        A.uidauthcode,
        A.itc,
        A.transactiondetails,
        A.deststan,
        A.sourcestan,
        A.hostresponsecode,
        A.institutionid,
        A.acquirer,
        A.bcrefid,
        A.cardno,
        A.rrn,
        A.transactiontype,
        A.filename,
        A.cardnotrim,
        A.uniqueid,
        A.transactiondatetime,
        A.transactionstatus,
        A.overall_probable_status,
        A.recon_created_date,
        A.priority_no,
        A.recon_key_priority_1_1_to_2,
        A.recon_key_priority_1_1_to_3,
        A.recon_key_priority_2_1_to_2,
        A.recon_key_priority_2_1_to_3,
        A.process_status,
        A.reconciliation_date_time,
        CURRENT_TIMESTAMP AS recon_updated_date,
        CASE
                WHEN C.recon_key_priority_1_2_to_1 IS NOT NULL THEN 'Reconciled'
                ELSE 'Not Reconciled'
        END AS recon_status_1_to_2,
        CASE
                WHEN D.recon_key_priority_1_3_to_1 IS NOT NULL THEN 'Reconciled'
                WHEN D.recon_key_priority_2_3_to_1 IS NOT NULL THEN 'Reconciled'
                ELSE 'Not Reconciled'
        END AS recon_status_1_to_3,
        CASE
                WHEN (C.recon_key_priority_1_2_to_1 IS NOT NULL AND D.recon_key_priority_1_3_to_1 IS NOT NULL) THEN 'Reconciled'
                WHEN (D.recon_key_priority_2_3_to_1 IS NOT NULL) THEN 'Reconciled'
                ELSE 'Not Reconciled'
        END AS overall_recon_status
FROM source_switchdata_tmp_details_og A
        LEFT JOIN source_aepscbsdata_tmp_details C ON (A.recon_key_priority_1_1_to_2 = C.recon_key_priority_1_2_to_1)
        LEFT JOIN source_npcidata_tmp_details D 
        ON (A.recon_key_priority_1_1_to_3 = D.recon_key_priority_1_3_to_1) 
        OR (A.recon_key_priority_2_1_to_3 = D.recon_key_priority_2_3_to_1);
DROP TABLE source_switchdata_tmp_details_og;
COMMIT;

Unique Constraints and Indexes:

A.uniqueid = Primary key and Index
A.recon_key_priority_1_1_to_3 = Index
A.recon_key_priority_1_1_to_2 = Index
D.recon_key_priority_1_3_to_1 = Index
A.recon_key_priority_2_1_to_3 = Index
D.recon_key_priority_2_3_to_1 = Index

Questions:

Currently, I am running the above query for 180 million rows (60M + 60M + 60M). In the future, I may need to run this query for 1 billion rows. Will this approach be scalable for 1 billion rows? We can increase the server specifications if needed, but will this approach be feasible? Essentially, if I were to recreate the table for 300 million rows or even 1 billion rows, will it be practical?
My team suggests updating the data in chunks of 1 million rows. Is this approach better than the current one?
The query currently takes around 20 minutes, which is acceptable. If the data size increases, what bottlenecks, such as I/O bottlenecks, should I be aware of to ensure the query time scales proportionally without getting stuck?
What are the limitations of the current approach? And what can I do to avoid such limitations?

Any insights or optimizations would be greatly appreciated. Thank you!

Laurenz Albe · Accepted Answer · 2025-01-06 07:28:02Z

Your statement will become slower if the tables get bigger, but I guess that's what you expect. But the slowdown won't be linear; I expect it to grow with the square of the number of rows, because of the OR in the join condition with source_npcidata_tmp_details. That OR forces PostgreSQL to perform a nested loop join, which will become very slow with big tables. Keep your join conditions to simple = if you want your queries to scale.

Another potential problem is the DISTINCT ON, which requires a sort that has computational complexity of O(n*log(n)), so the execution time will increase more than linearly. Consider carefully if your data allow duplicate uniqueids in the query result, and only use DISTINCT if you really have to.

The bottleneck here is CPU speed, and you won't be able to scale that.

Updating the table instead of creating a new copy is a good idea if most rows will remain unchanged. In that case, you should add WHERE conditions so that the rows are only modified if the values change.

Stack Exchange Network

Optimizing Performance for Large Table Updates in PostgreSQL

Server Specifications:

PostgreSQL Configuration:

Table Details:

Query:

Unique Constraints and Indexes:

Questions:

1 Answer 1

Your Answer

Hot Network Questions

Optimizing Performance for Large Table Updates in PostgreSQL

Server Specifications:

PostgreSQL Configuration:

Table Details:

Query:

Unique Constraints and Indexes:

Questions:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions