I have a setup for forum posts and want to retrieve posts created by a specific user using the following query:
SELECT * FROM forum.posts WHERE authorid=? ORDER BY postid LIMIT ?
Where authorid is indexed and postid is the clustered primary key. Here is the full schema:
+--------------+--------------------------+-------------+
| Column | Type | Modifiers |
|--------------+--------------------------+-------------|
| postid | integer | not null |
| postdate | timestamp with time zone | not null |
| postbody | text | not null |
| parentthread | integer | not null |
| parentpage | integer | not null |
| authorid | integer | not null |
| totalpages | integer | |
| postsubject | text | |
| thread | boolean | not null |
| subforum | smallint | not null |
+--------------+--------------------------+-------------+
Indexes:
"posts_pkey" PRIMARY KEY, btree (postid) CLUSTER
"date_index" btree (postdate)
"forum_index" btree (subforum)
"page_index" btree (parentpage)
"parent_index" btree (parentthread)
"thread_index" btree (thread)
"user_index" btree (authorid)
However for users with a lot of posts the query takes an extremely long time because it first uses the index to retrieve keys but then has to sort all of them over again. Here is EXPLAIN ANALYZE on one user:
Limit (cost=22881.46..22881.53 rows=25 width=139) (actual time=1424.436..1424.451 rows=25 loops=1)
-> Sort (cost=22881.46..22897.09 rows=6250 width=139) (actual time=1424.434..1424.442 rows=25 loops=1)
Sort Key: postid
Sort Method: top-N heapsort Memory: 43kB
-> Index Scan using user_index on posts (cost=0.57..22705.09 rows=6250 width=139) (actual time=2.235..1420.733 rows=3022 loops=1)
Index Cond: (authorid = ?)
Planning time: 0.114 ms
Execution time: 1424.489 ms
I thought that clustering would help but there are just too many posts and for users with more posts it scans with a filter instead of sorting the index. Although the cost is low it still ends up taking forever because there are so many rows:
Limit (cost=0.57..149978.39 rows=25 width=139) (actual time=205822.311..210766.374 rows=25 loops=1)
-> Index Scan using posts_pkey on posts (cost=0.57..664137787.62 rows=110706 width=139) (actual time=205822.310..210766.359 rows=25 loops=1)
Filter: (authorid = ?)
Rows Removed by Filter: 76736945
Planning time: 0.111 ms
Execution time: 210766.403 ms
How do I go about retrieving the posts by user sorted? Is there any practical way in SQL for an the index of authorids to be sorted based on authorid? This functionality is important for what I am doing and at this point a SQL database doesn't seem to be the best option.