1

I have a setup for forum posts and want to retrieve posts created by a specific user using the following query:

SELECT * FROM forum.posts WHERE authorid=? ORDER BY postid LIMIT ?

Where authorid is indexed and postid is the clustered primary key. Here is the full schema:

+--------------+--------------------------+-------------+
| Column       | Type                     | Modifiers   |
|--------------+--------------------------+-------------|
| postid       | integer                  |  not null   |
| postdate     | timestamp with time zone |  not null   |
| postbody     | text                     |  not null   |
| parentthread | integer                  |  not null   |
| parentpage   | integer                  |  not null   |
| authorid     | integer                  |  not null   |
| totalpages   | integer                  |             |
| postsubject  | text                     |             |
| thread       | boolean                  |  not null   |
| subforum     | smallint                 |  not null   |
+--------------+--------------------------+-------------+
Indexes:
    "posts_pkey" PRIMARY KEY, btree (postid) CLUSTER
    "date_index" btree (postdate)
    "forum_index" btree (subforum)
    "page_index" btree (parentpage)
    "parent_index" btree (parentthread)
    "thread_index" btree (thread)
    "user_index" btree (authorid)

However for users with a lot of posts the query takes an extremely long time because it first uses the index to retrieve keys but then has to sort all of them over again. Here is EXPLAIN ANALYZE on one user:

Limit  (cost=22881.46..22881.53 rows=25 width=139) (actual time=1424.436..1424.451 rows=25 loops=1)
  ->  Sort  (cost=22881.46..22897.09 rows=6250 width=139) (actual time=1424.434..1424.442 rows=25 loops=1)
        Sort Key: postid
        Sort Method: top-N heapsort  Memory: 43kB
        ->  Index Scan using user_index on posts  (cost=0.57..22705.09 rows=6250 width=139) (actual time=2.235..1420.733 rows=3022 loops=1)
              Index Cond: (authorid = ?)
Planning time: 0.114 ms
Execution time: 1424.489 ms

I thought that clustering would help but there are just too many posts and for users with more posts it scans with a filter instead of sorting the index. Although the cost is low it still ends up taking forever because there are so many rows:

Limit  (cost=0.57..149978.39 rows=25 width=139) (actual time=205822.311..210766.374 rows=25 loops=1)
  ->  Index Scan using posts_pkey on posts  (cost=0.57..664137787.62 rows=110706 width=139) (actual time=205822.310..210766.359 rows=25 loops=1)
        Filter: (authorid = ?)
        Rows Removed by Filter: 76736945
Planning time: 0.111 ms
Execution time: 210766.403 ms

How do I go about retrieving the posts by user sorted? Is there any practical way in SQL for an the index of authorids to be sorted based on authorid? This functionality is important for what I am doing and at this point a SQL database doesn't seem to be the best option.

1 Answer 1

1

For this query:

SELECT *
FROM forum.posts
WHERE authorid = ?
ORDER BY postid
LIMIT ?

I would recommend a secondary index on (authorid, postid). This should prevent the sorting.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.