PostgreSQL different index creation time for same datatype

Question

I have a table with three columns A, B, C, all of type bytea. There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs

When creating indexes for all columns with

CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);

index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database. When running

SELECT * FROM pg_stat_activity;

wait_event_type and wait_event are both NULL, state is active.

Why are the second index creations taking so long, and can I do anything to speed them up?

FXD · Accepted Answer · 2019-01-03 18:46:01Z

2

Ensure the statistics on your table are up-to-date.
Then execute the following query:

SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'

Basically, the database will have more work to create indexes when:

The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.

I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.

Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:

Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.

Note: the description is simplified to highlight the difference.

Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).

Solution 2:
Order records on the disk.
The initialization would be something like this:

CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;

The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.

Solution3: Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

edited Jan 3, 2019 at 18:46

answered Jan 3, 2019 at 16:16

FXD

2,0901 gold badge10 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user10816845 Over a year ago

Statistics are up to date, column A has n_distinct -1, corr. -0.004; column B 39608 and 0.02; column C 38426 and -0.011. Column A is basically a unique column, it isn't specified as such however.

user10816845 Over a year ago

I don't know if the n_distinct values are absolute numbers, but if they are, the analyze command grossly underestimates the values for B and C. There are a few million different values for B and C. My use case is the following: A is a unique identifier for the transaction, while B and C are sender and recipient. I need to quickly access both B and C without knowledge of A (e.g. to quickly find all transactions done by C). Can you give me a hint how to solve this? Is Solution 3 a viable approach for that many unique values for B and C?

FXD Over a year ago

If I understood correctly, a "person" x could appear as both sender and recipient on different transactions. If this is the case, you probably need to change your DB design where users will be in a table (primary key created with a serial) replace B and C with foreign keys in your transactions table. Integer (= the underlying type for serial) will be much easier to handle for the DB as it is way smaller than 20 bytes

FXD Over a year ago

Also, in old versions of Postgresql, hash indexes were not recommended as they were not faster, would take much more space and needed way more time to be created than btree indexes (the default). The warning was removed in recent version of the document but does it really mean they corrected 100% of the issues? What if it's only 90%? would that be visible in your queries? Give btree indexes a try, maybe.

FXD Over a year ago

Thirdly, try to see if all the fields are truly necessary in your table? For instance, if you have a bunch of fields like "code1", "code2", ..., "code10", ..., try to move them to a table [transactions_A (Foreign Key), code ID (smallint), code value (string?)]. If you do not need to retrieve all the fields every time and you can make more records fit in each table block, this will save some time on your queries. This may appear counterintuitive since it means this will sometimes add some JOIN to your queries but I would really consider doing such a thing for fields not always retrieved.

|

Collectives™ on Stack Overflow

PostgreSQL different index creation time for same datatype

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related