Postgres, sorting

Question

There is a table with this structure:

    Table "public.all_emails"
│ Column | Type | Modifiers
│ ----------- + -------- + -----------
│ email | text |
│ frequency | bigint |
│Indexes:
│ "all_emails_email_idx" UNIQUE, btree (email)

I want to move all records from this table into another database by doing some more operations with them. To speed up, I wrote multi-process application that takes several times a specific piece of the table. In order to know which of the following process begins, I sort the table as follows:

Select email from all_emails order by email limit # {PULL_SIZE} offset # {offset}

With a large number of records in the table, this operation is quite expensive and not optimal. How can I make it better?

Patrick · Accepted Answer · 2016-09-09 08:08:55Z

You can CLUSTER your table for this purpose:

CLUSTER all_emails USING all_emails_email_idx;
ANALYZE all_emails;

Clustering physically re-orders the rows in the table according to the index specified. So the email addresses are ordered according to the email address and then the query - which processes like any other query - will find all rows from the requested subset on a limited number of on-disk pages and I/O is therefore reduced, as well as any ordering (because the query planner recognizes that the table is clustered on a specific index). The ANALYZE command updates the table statistics after the clustering to help the query planner make optimal choices.

This really only works on a table that is read-only or not very frequently updated or having new rows inserted, because the clustering is not maintained: it is a one-off process. Clustering is also a fairly "expensive" process because the entire table is rewritten and an exclusive table lock is required. You can periodically re-cluster the table using the same index with the abbreviated form CLUSTER all_emails.

Laurenz Albe · Accepted Answer · 2016-09-09 10:06:07Z

1

Nothing will be faster than a single sequential scan for reading a whole table, at least until PostgreSQL 9.6, where parallel sequential scans will be introduced.

It would be tempting to split the table by ctid, the physical location of the tuple in the table, but PostgreSQL doesn't optimise access by ctid for operators different from =:

test=> EXPLAIN SELECT * FROM large WHERE ctid BETWEEN '(390, 0)' AND '(400,0)';
┌───────────────────────────────────────────────────────────────────┐
│                            QUERY PLAN                             │
├───────────────────────────────────────────────────────────────────┤
│ Seq Scan on large  (cost=0.00..1943.00 rows=500 width=8)          │
│   Filter: ((ctid >= '(390,0)'::tid) AND (ctid <= '(400,0)'::tid)) │
└───────────────────────────────────────────────────────────────────┘
(2 rows)

The same holds for inserts: Without being able to show numbers, I'm pretty sure that one process INSERTing or COPYing into one table will not be slower than several processes all loading data into the same table.

Since it seems that the bottleneck is the processing of the rows between SELECT at the origin and INSERT at the destination, I'd suggest the following:

Have one thread that performs a single SELECT * FROM all_emails.
Create a number of threads that can perform the expensive processing in parallel.
The first thread distributes the result rows to the parallel workers in a round robin fashion.
Yet another thread collects the results of the parallel workers and composes them into input for a COPY tablename FROM STDIN statement that it executes.

edited Sep 9, 2016 at 10:06

answered Sep 9, 2016 at 7:56

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

3 Comments

Patrick Over a year ago

How does this address the issue that the OP wants to read the rows in batches sorted by email address?

Laurenz Albe Over a year ago

He/she doesn't want that. The ORDER BY is just there give meaning to the OFFSET/LIMIT combination. The goal is to move a table from one database to another as fast as possible, and the question was whether there is a way to make it faster with parallelization.

Marsel.V Over a year ago

@LaurenzAlbe, You may not quite understand me correctly. My script has several parallel processes to take a piece of the table to do some of the operations on the rows, and insert it into another table. The problem has been in a non-optimal sample required for each piece of process

Collectives™ on Stack Overflow

Postgres, sorting

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related