1

I'm having a problem with a postgresql table. The table has 14 columns, 2 of them that are indexed as following:

       Column        | Type | Modifiers 
---------------------+------+-----------
 u_id                | text | not null
 p_date              | date | not null
...

Indexes:
    "unq_uid_pdate" UNIQUE CONSTRAINT, btree (u_id, p_date)
    "pdate_idx" btree (p_date)

The size of the table is around 5 billion rows, with a total of around 10 million distinct values in u_id, and each u_id has around 500 distinct values in p_date on average.

When I try to export by u_id, or without any filters, the performance is good reaching speeds of up to 400K rows per second without significant drops over time.

psql "..." -c "\\copy (SELECT u_id, p_date FROM l_nights) to stdout with csv header" | pv --line-mode --rate > /dev/null

[ 468k/s]
[ 485k/s]
[ 497k/s]
[ 435k/s]

However, if I want to export the approximately 10 million rows of a specific p_date, the performance is very poor, starting at few thousand rows per second, and reaching a stall in few minutes. Adding limit the behaviour is the same:

psql "..." -c "\\copy (SELECT u_id, p_date FROM l_nights where p_date = '2020-06-01' limit 50000) to stdout with csv header" | pv --line-mode --rate > /dev/null

[1.09k/s]
[1.28k/s]
[1.48k/s]
[3.01k/s]
[ 954 /s]
[1.40k/s]
[ 918 /s]
[ 849 /s]
[ 383 /s]
[ 364 /s]
[ 358 /s]
[0.00 /s]
[ 170 /s]
[0.00 /s]

The query is using the proper index, based on the explain response:

explain analyse select u_id, p_date from l_nights where p_date = '2020-06-01' limit 50000;

QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.72..158646.09 rows=50000 width=15) (actual time=9.672..207430.728 rows=50000 loops=1)
   ->  Index Scan using pdate_idx on l_nights  (cost=0.72..233913736.58 rows=73722207 width=15) (actual time=9.671..207396.204 rows=50000 loops=1)
         Index Cond: (p_date = '2020-06-01'::date)
 Planning Time: 0.182 ms
 Execution Time: 207451.360 ms

Based on the current performance, it seems to me like the search was doing a full scan, even though the column is indexed. Other export strategies like getting a postgresql cursor in node.js also behave the same.

What am I doing wrong?

8
  • 1
    Maybe an index on (p_date, u_id) might be better. You can try that. But I don't know if the retrieval is really the problem here. The bottleneck might be the writing to the export file. Commented Sep 2, 2020 at 10:56
  • 1
    The LIMITed query is exporting 50,000 rows per second. You need to capture the plan for the query which exhibits the problem, not a related query which doesn't. 'auto_explain' can help with that. Commented Sep 2, 2020 at 13:31
  • With LIMIT, the behaviour is the same, but 1000 rows is not enough to perceive it. I tried with 50K rows of limit and got very poor performance as well. I update the response to make it clear. Thanks. Commented Sep 2, 2020 at 16:04
  • I agree with sticky an (p_date, u_id) will be better, (I ll explain why I think so, if you want), but could you provide your data type for p_date ? Commented Sep 5, 2020 at 23:57
  • @hsibboni I added the data types of both column in the description. Why do you think a multicolumn index is better? If it's because it can obtain the fields directly from the index, I also tried to select only the p_date field, and the speed is the same. Thanks Commented Sep 7, 2020 at 9:30

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.