I used to update a very large table using UPDATE queries, but they were taking too long to execute. To improve performance, I switched to using the CREATE TABLE approach and adding indexes to update the table. This approach has significantly increased my query execution speed, but I want to understand its scalability and limitations.
Server Specifications:
- PostgreSQL version: 15.6
- RAM: 32 GB
- Cores: 16
- Disk Space: SSD 250 GB (50% free)
- OS: Linux Ubuntu 22.04
PostgreSQL Configuration:
max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5242kB
huge_pages = try
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 16
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
Table Details:
| Table Name | Row Count | Size |
|---|---|---|
| source_switchdata_tmp_details | 60 Million | 30 GB |
| source_npcidata_tmp_details | 60 Million | 30 GB |
| source_aepscbsdata_tmp_details | 60 Million | 30 GB |
Query:
BEGIN;
ALTER TABLE source_switchdata_tmp_details RENAME TO source_switchdata_tmp_details_og;
CREATE TABLE source_switchdata_tmp_details AS
SELECT DISTINCT ON (A.uniqueid) A.transactiondate,
A.cycles,
A.transactionamount,
A.bcid,
A.bcname,
A.username,
A.terminalid,
A.uidauthcode,
A.itc,
A.transactiondetails,
A.deststan,
A.sourcestan,
A.hostresponsecode,
A.institutionid,
A.acquirer,
A.bcrefid,
A.cardno,
A.rrn,
A.transactiontype,
A.filename,
A.cardnotrim,
A.uniqueid,
A.transactiondatetime,
A.transactionstatus,
A.overall_probable_status,
A.recon_created_date,
A.priority_no,
A.recon_key_priority_1_1_to_2,
A.recon_key_priority_1_1_to_3,
A.recon_key_priority_2_1_to_2,
A.recon_key_priority_2_1_to_3,
A.process_status,
A.reconciliation_date_time,
CURRENT_TIMESTAMP AS recon_updated_date,
CASE
WHEN C.recon_key_priority_1_2_to_1 IS NOT NULL THEN 'Reconciled'
ELSE 'Not Reconciled'
END AS recon_status_1_to_2,
CASE
WHEN D.recon_key_priority_1_3_to_1 IS NOT NULL THEN 'Reconciled'
WHEN D.recon_key_priority_2_3_to_1 IS NOT NULL THEN 'Reconciled'
ELSE 'Not Reconciled'
END AS recon_status_1_to_3,
CASE
WHEN (C.recon_key_priority_1_2_to_1 IS NOT NULL AND D.recon_key_priority_1_3_to_1 IS NOT NULL) THEN 'Reconciled'
WHEN (D.recon_key_priority_2_3_to_1 IS NOT NULL) THEN 'Reconciled'
ELSE 'Not Reconciled'
END AS overall_recon_status
FROM source_switchdata_tmp_details_og A
LEFT JOIN source_aepscbsdata_tmp_details C ON (A.recon_key_priority_1_1_to_2 = C.recon_key_priority_1_2_to_1)
LEFT JOIN source_npcidata_tmp_details D
ON (A.recon_key_priority_1_1_to_3 = D.recon_key_priority_1_3_to_1)
OR (A.recon_key_priority_2_1_to_3 = D.recon_key_priority_2_3_to_1);
DROP TABLE source_switchdata_tmp_details_og;
COMMIT;
Unique Constraints and Indexes:
A.uniqueid = Primary key and Index
A.recon_key_priority_1_1_to_3 = Index
A.recon_key_priority_1_1_to_2 = Index
D.recon_key_priority_1_3_to_1 = Index
A.recon_key_priority_2_1_to_3 = Index
D.recon_key_priority_2_3_to_1 = Index
Questions:
- Currently, I am running the above query for 180 million rows (60M + 60M + 60M). In the future, I may need to run this query for 1 billion rows. Will this approach be scalable for 1 billion rows? We can increase the server specifications if needed, but will this approach be feasible? Essentially, if I were to recreate the table for 300 million rows or even 1 billion rows, will it be practical?
- My team suggests updating the data in chunks of 1 million rows. Is this approach better than the current one?
- The query currently takes around 20 minutes, which is acceptable. If the data size increases, what bottlenecks, such as I/O bottlenecks, should I be aware of to ensure the query time scales proportionally without getting stuck?
- What are the limitations of the current approach? And what can I do to avoid such limitations?
Any insights or optimizations would be greatly appreciated. Thank you!