0

I'm trying to determine the best indexes for a table in PostgreSQL. I expect on the order of ~10b rows and ~10TB data.

The table has 5 main columns used for filtering and/or sorting

  • Filtering: 3 Columns of binary data stored as bytea
  • Filtering / sorting: 2 Columns of type integer
CREATE TABLE table (
  filter_key_1 AS BYTEA,    -- filtering
  filter_key_2 AS BYTEA,    -- filtering
  filter_key_3 AS BYTEA,    -- filtering
  sort_key_1   AS INTEGER,  -- filtering & sorting
  sort_key_2   AS INTEGER   -- filtering & sorting
)

Queries will be:

SELECT * FROM table WHERE filter_key_1 = $1 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_2 = $1 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_3 = $1 ORDER BY sort_key_1, sort_key_2 LIMIT 15;

SELECT * FROM table WHERE filter_key_1 = $1 AND sort_key_1 <= $2 AND sort_key_2 <= $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_2 = $1 AND sort_key_1 <= $2 AND sort_key_2 <= $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_3 = $1 AND sort_key_1 <= $2 AND sort_key_2 <= $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;

What are the ideal indexes for the table? How large will they get with ~10b rows? How much will they limit write throughput?

Edit

What if I want to add additional queries such as below. Would the indexes from above hold-up?

SELECT * FROM table WHERE filter_key_1 = $1 AND filter_key_2 = $2 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
SELECT * FROM table WHERE filter_key_1 = $1 AND filter_key_2 = $2 AND filter_key_3 = $3 ORDER BY sort_key_1, sort_key_2 LIMIT 15;
-- ...

IO requirements

The workload is heavy read, low write.

Read speed is important. Write speed is less important (can live with up-to 3 seconds per insert)

  • Read:
    • expecting on average 150 read queries/sec
    • most queries pulling in 100 to 100,000 rows after WHERE and before LIMIT
  • Write:
    • expecting 1 write query/12sec, 0.08 queries/sec
    • writing 500-1000 rows/query, 42-84 rows/sec
2
  • 2
    What is perfect here depends on a) how selective the WHERE conditions are, b) what the read/write ratio of the table is c) how often your queries run and how important speed is. The indexes can become larger than the table. Commented Sep 30, 2022 at 6:01
  • @LaurenzAlbe low write, heavy read. Read: Expecting on average 150 read queries / sec returning 15 rows / query -> 2,250 rows / second. Write: Expecting 1 write query per 12 seconds writing 500-1000 rows -> 0.08 queries / second, 42-84 rows / second. Read speed is important. Write speed is less important (can live with up-to 3 seconds per insert). Commented Sep 30, 2022 at 6:30

1 Answer 1

2

Since you need to run these queries all the time, you will have to optimize them as much as possible. That would mean

CREATE INDEX ON tab (filter_key_1, sort_key_1, sort_key_2);
CREATE INDEX ON tab (filter_key_2, sort_key_1, sort_key_2);
CREATE INDEX ON tab (filter_key_3, sort_key_1, sort_key_2);

Together, these indexes should be substantially larger than your table.

Sign up to request clarification or add additional context in comments.

2 Comments

Size might indeed become an issue, especially if we already have 10TB of just plain data!! I wonder if creating 3 HASH indexes on just each filter key might already be enough, reading and sorting 100 to 100.000 lines will bring some overhead but still might be acceptable? Then again, doing 150 of those per sec and having the 'occasional write' in the background too.. I wonder. Another clear example of 'no such thing as a free lunch' =)
Yes, it is the 150 per second that will kill you. Otherwise less perfect indexes would do. Of course, there is always the option to change the requirements...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.