Postgresql IN clause versus nested SELECT with JOIN performance

Question

I have a query right now that works well but will have scaling problems. The solution I have found is wildly slow. I'm looking to speed up the second query.

Old query that won't scale well:

SELECT user.score
FROM users
WHERE
  user.id IN (
    SELECT user_id 
    FROM companies_users 
    ON companies_users.company_id = X
)

Then I would iterate across the different scores to group them. Scores range from -10 to 10. The problem comes from the IN SELECT statement and the iteration. There could be over a million user_ids returned.

The alternative I've come up with should scale better but is wildly slow:

SELECT 
  COUNT(*) as total_scores,
  (SELECT COUNT(*) FROM users 
    JOIN companies_users as cu ON cu.company_id = cu.user_id
    WHERE users.score = 10 AND cu.company_id = X) as "10",
  (SELECT COUNT(*) FROM users 
    JOIN companies_users as cu ON cu.company_id = cu.user_id
    WHERE users.score = 9 AND cu.company_id = X) as "9",
...
  (SELECT COUNT(*) FROM users 
    JOIN companies_users as cu ON cu.company_id = cu.user_id
    WHERE users.score = -9 AND cu.company_id = X) as "-9",
  (SELECT COUNT(*) FROM users 
    JOIN companies_users as cu ON cu.company_id = cu.user_id
    WHERE users.score = -10 AND cu.company_id = X) as "-10"
FROM users
  JOIN companies_users as cu ON cu.company_id = cu.user_id
  WHERE cu.company_id = X

The first query requires iteration to get into working data. The second is good to go.

Is there a way to pull the JOIN out of the nested SELECTs? That seems to be causing the majority of the slowdown in the second query. Also, am I right that the first query won't scale well when dealing with millions of ids?

jcaron · Accepted Answer · 2017-02-10 18:12:30Z

1

What would be the problem with:

SELECT u.score
FROM companies_users cu
    JOIN users u ON cu.user_id = u.id
WHERE cu.company_id=?
GROUP BY u.score
ORDER BY u.score

?

Also, do you have appropriate indices? You need an index on companies_users(company_id), and one on users(id). You may try adding one on companies_users(user_id) just in case the planner decides it's better to do the query the the other way around. EXPLAIN and EXPLAIN ANALYZE are your friends.

answered Feb 10, 2017 at 18:12

jcaron

17.8k6 gold badges36 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

amiksch Over a year ago

Thanks for the reply! That is pretty close to perfect. I'm actually looking for the counts on the different scores. I used your solution but changed the select portion to u.score, count(u.score) and have got all the data! Thanks again.

Collectives™ on Stack Overflow

Postgresql IN clause versus nested SELECT with JOIN performance

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related