0

I have a table with three columns A, B, C, all of type bytea. There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs

When creating indexes for all columns with

CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);

index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database. When running

SELECT * FROM pg_stat_activity;

wait_event_type and wait_event are both NULL, state is active.

Why are the second index creations taking so long, and can I do anything to speed them up?

1 Answer 1

2

Ensure the statistics on your table are up-to-date.
Then execute the following query:

SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'

Basically, the database will have more work to create indexes when:

  • The number of distinct values gets higher.
  • The correlation (= are values in the field physically stored in order) is close to 0.

I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.

Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:

  • Column A: it was detected as unique
    A single scan is enough as the DB knows 1 record = 1 index entry.
  • Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
    Each index entry takes an entire FULL SCAN of the table.

Note: the description is simplified to highlight the difference.


Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).


Solution 2:
Order records on the disk.
The initialization would be something like this:

CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;

The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.


Solution3: Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

Sign up to request clarification or add additional context in comments.

9 Comments

Statistics are up to date, column A has n_distinct -1, corr. -0.004; column B 39608 and 0.02; column C 38426 and -0.011. Column A is basically a unique column, it isn't specified as such however.
I don't know if the n_distinct values are absolute numbers, but if they are, the analyze command grossly underestimates the values for B and C. There are a few million different values for B and C. My use case is the following: A is a unique identifier for the transaction, while B and C are sender and recipient. I need to quickly access both B and C without knowledge of A (e.g. to quickly find all transactions done by C). Can you give me a hint how to solve this? Is Solution 3 a viable approach for that many unique values for B and C?
If I understood correctly, a "person" x could appear as both sender and recipient on different transactions. If this is the case, you probably need to change your DB design where users will be in a table (primary key created with a serial) replace B and C with foreign keys in your transactions table. Integer (= the underlying type for serial) will be much easier to handle for the DB as it is way smaller than 20 bytes
Also, in old versions of Postgresql, hash indexes were not recommended as they were not faster, would take much more space and needed way more time to be created than btree indexes (the default). The warning was removed in recent version of the document but does it really mean they corrected 100% of the issues? What if it's only 90%? would that be visible in your queries? Give btree indexes a try, maybe.
Thirdly, try to see if all the fields are truly necessary in your table? For instance, if you have a bunch of fields like "code1", "code2", ..., "code10", ..., try to move them to a table [transactions_A (Foreign Key), code ID (smallint), code value (string?)]. If you do not need to retrieve all the fields every time and you can make more records fit in each table block, this will save some time on your queries. This may appear counterintuitive since it means this will sometimes add some JOIN to your queries but I would really consider doing such a thing for fields not always retrieved.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.