PostgreSQL: How to select on non-aggregating column?

Question

Seems like a simple question but I'm having trouble accomplishing it. What I want to do is return all names that have duplicate ids. The view looks as such:

id |  name  | other_col
---+--------+----------
 1 | James  |    x
 2 | John   |    x
 2 | David  |    x
 3 | Emily  |    x
 4 | Cameron|    x
 4 | Thomas |    x

And so in this case, I'd just want the result:

name
-------
John
David
Cameron
Thomas

The following query works but it seems like an overkill to have two separate selects:

select name 
from view where id = ANY(select id from view 
                         WHERE other_col='x' 
                         group by id 
                         having count(id) > 1) 
      and other_col='x';

I believe it should be possible to do something under the lines of:

select name from view WHERE other_col='x' group by id, name having count(id) > 1;

But this returns nothing at all! What is the 'proper' query?

Do I just have to it like my first working suggestion or is there a better way?

Are you simply trying to avoid using more than one SELECT statement? — Nick
– Nick, Commented Jul 15, 2016 at 20:47
@Nicarus Yes. I want to remove the redundant WHERE other_col='x' (the real conditions are quite long) so removing 1 SELECT will make it clearer. — nico
– nico, Commented Jul 15, 2016 at 20:51

Nick · Accepted Answer · 2016-07-15 20:55:38Z

2

You state you want to avoid two "queries", which isn't really possible. There are plenty of solutions available, but I would use a CTE like so:

WITH cte AS
(
SELECT
    id,
    name,
    other_col,
    COUNT(name) OVER(PARTITION BY id) AS id_count
FROM
    table
)

SELECT name FROM cte WHERE id_count > 1;

You can reuse the CTE, so you don't have to duplicate logic and I personally find it easier to read and understand what it is doing.

answered Jul 15, 2016 at 20:55

Nick

7,5012 gold badges25 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

nico Over a year ago

Absolutely perfect! Thanks Nicarus, this is exactly what I was looking for.

Muthaiah PL Over a year ago

Good one. BTW, I have studied Temp Tables and derived tables are good in performance when comparing with CTE, am i right?

Nick Over a year ago

@MuthaiahPL - In my experience it depends on what you are doing, but temp tables do allow for such things as adding indexes - CTEs do not. In this case the question was not about performance, but simplicity of writing the query, which I believe the CTE provides.

Nick · Accepted Answer · 2016-07-15 20:47:14Z

2

SELECT name FROM Table
WHERE id IN (SELECT id, COUNT(*) FROM Table GROUP BY id HAVING COUNT(*)>1) Temp

edited Jul 15, 2016 at 20:47

Nick

7,5012 gold badges25 silver badges46 bronze badges

answered Jul 15, 2016 at 20:40

Muthaiah PL

1,1983 gold badges16 silver badges27 bronze badges

3 Comments

nico Over a year ago

This is pretty much identical to what I suggested and I'm wondering if there's a better way to accomplish this...

Muthaiah PL Over a year ago

May i know in what terms you are looking for a better way?

nico Over a year ago

Specifically, I just want to be able to know if this can be done in a single query because I'd rather not have to repeat two separate selects (if possible). The actual where conditions are quite long and I'd like to simplify it.

krokodilko · Accepted Answer · 2016-07-15 20:43:35Z

0

Use EXIST operator

SELECT * FROM table t1
WHERE EXISTS(
  SELECT null FROM table t2
  WHERE t1.id = t2.id 
    AND t1.name <> t2.name
)

answered Jul 15, 2016 at 20:43

krokodilko

36.3k7 gold badges62 silver badges86 bronze badges

Comments

Bohemian · Accepted Answer · 2016-07-15 21:03:59Z

0

Use a join:

select distinct name 
from view v1
join view v2 on v1.id = v2.id
  and v1.name != v2.name

The use of distinct is there in case there are more than 2 rows sharing the same id. If that's not possible, you can omit distinct.

A note: Naming a column id when it's not unique will likely cause confusion, because it's the industry standard for the unique identifier column. If there isn't a unique column at all, it will cause coding difficulties.

edited Jul 15, 2016 at 21:03

answered Jul 15, 2016 at 20:51

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

1 Comment

nico Over a year ago

This isn't a bad suggestion but it happens that my view is large so I'd have to go on v1.id = v2.id AND v1.other_col='x' AND v2.other_col='x' and it takes longer than the accepted answer. About the id, I will make sure to name my columns more appropriately in future questions, thanks!

Community · Accepted Answer · 2017-05-23 12:14:44Z

Do not use a CTE. That's typically more expensive because Postgres has to materialize the intermediary result.

An EXISTS semi-join is typically fastest for this. Just make sure to repeat predicates (or match the values):

SELECT name 
FROM   view v
WHERE  other_col = 'x'
AND    EXISTS (
   SELECT 1 FROM view 
   WHERE  other_col = 'x' -- or: other_col = v.other_col
   AND    id <> v.id      -- exclude join to self
   );

That's a single query, even if you see the keyword SELECT twice here. An EXISTS expression does not produce a derived table, it will be resolved to simple index look-ups.

Speaking of which: a multicolumn index on (other_col, id) should help. Depending on data distribution and access patterns, appending the payload column name to enable index-only scans might help: (other_col, id, name). Or even a partial index, if other_col = 'x' is a constant predicate:

CREATE INDEX ON view (id) WHERE other_col = 'x';

PostgreSQL does not use a partial index

The upcoming Postgres 9.6 would even allow an index-only scan on the partial index:

CREATE INDEX ON view (id, name) WHERE other_col = 'x';

You will love this improvement (quoting the /devel manual):

Allow using an index-only scan with a partial index when the index's predicate involves column(s) not stored in the index (Tomas Vondra, Kyotaro Horiguchi)

An index-only scan is now allowed if the query mentions such columns only in WHERE clauses that match the index predicate

Verify performance with EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Run a couple of times to rule out caching effects.

Collectives™ on Stack Overflow

PostgreSQL: How to select on non-aggregating column?

5 Answers 5

3 Comments

3 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

3 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related