1

There is a table with this structure:

    Table "public.all_emails"
│ Column | Type | Modifiers
│ ----------- + -------- + -----------
│ email | text |
│ frequency | bigint |
│Indexes:
│ "all_emails_email_idx" UNIQUE, btree (email)

I want to move all records from this table into another database by doing some more operations with them. To speed up, I wrote multi-process application that takes several times a specific piece of the table. In order to know which of the following process begins, I sort the table as follows:

Select email from all_emails order by email limit # {PULL_SIZE} offset # {offset}

With a large number of records in the table, this operation is quite expensive and not optimal. How can I make it better?

2 Answers 2

1

You can CLUSTER your table for this purpose:

CLUSTER all_emails USING all_emails_email_idx;
ANALYZE all_emails;

Clustering physically re-orders the rows in the table according to the index specified. So the email addresses are ordered according to the email address and then the query - which processes like any other query - will find all rows from the requested subset on a limited number of on-disk pages and I/O is therefore reduced, as well as any ordering (because the query planner recognizes that the table is clustered on a specific index). The ANALYZE command updates the table statistics after the clustering to help the query planner make optimal choices.

This really only works on a table that is read-only or not very frequently updated or having new rows inserted, because the clustering is not maintained: it is a one-off process. Clustering is also a fairly "expensive" process because the entire table is rewritten and an exclusive table lock is required. You can periodically re-cluster the table using the same index with the abbreviated form CLUSTER all_emails.

Sign up to request clarification or add additional context in comments.

Comments

1

Nothing will be faster than a single sequential scan for reading a whole table, at least until PostgreSQL 9.6, where parallel sequential scans will be introduced.

It would be tempting to split the table by ctid, the physical location of the tuple in the table, but PostgreSQL doesn't optimise access by ctid for operators different from =:

test=> EXPLAIN SELECT * FROM large WHERE ctid BETWEEN '(390, 0)' AND '(400,0)';
┌───────────────────────────────────────────────────────────────────┐
│                            QUERY PLAN                             │
├───────────────────────────────────────────────────────────────────┤
│ Seq Scan on large  (cost=0.00..1943.00 rows=500 width=8)          │
│   Filter: ((ctid >= '(390,0)'::tid) AND (ctid <= '(400,0)'::tid)) │
└───────────────────────────────────────────────────────────────────┘
(2 rows)

The same holds for inserts: Without being able to show numbers, I'm pretty sure that one process INSERTing or COPYing into one table will not be slower than several processes all loading data into the same table.

Since it seems that the bottleneck is the processing of the rows between SELECT at the origin and INSERT at the destination, I'd suggest the following:

  1. Have one thread that performs a single SELECT * FROM all_emails.

  2. Create a number of threads that can perform the expensive processing in parallel.

  3. The first thread distributes the result rows to the parallel workers in a round robin fashion.

  4. Yet another thread collects the results of the parallel workers and composes them into input for a COPY tablename FROM STDIN statement that it executes.

3 Comments

How does this address the issue that the OP wants to read the rows in batches sorted by email address?
He/she doesn't want that. The ORDER BY is just there give meaning to the OFFSET/LIMIT combination. The goal is to move a table from one database to another as fast as possible, and the question was whether there is a way to make it faster with parallelization.
@LaurenzAlbe, You may not quite understand me correctly. My script has several parallel processes to take a piece of the table to do some of the operations on the rows, and insert it into another table. The problem has been in a non-optimal sample required for each piece of process

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.