2

Seems like a simple question but I'm having trouble accomplishing it. What I want to do is return all names that have duplicate ids. The view looks as such:

id |  name  | other_col
---+--------+----------
 1 | James  |    x
 2 | John   |    x
 2 | David  |    x
 3 | Emily  |    x
 4 | Cameron|    x
 4 | Thomas |    x

And so in this case, I'd just want the result:

name
-------
John
David
Cameron
Thomas

The following query works but it seems like an overkill to have two separate selects:

select name 
from view where id = ANY(select id from view 
                         WHERE other_col='x' 
                         group by id 
                         having count(id) > 1) 
      and other_col='x';

I believe it should be possible to do something under the lines of:

select name from view WHERE other_col='x' group by id, name having count(id) > 1;

But this returns nothing at all! What is the 'proper' query?

Do I just have to it like my first working suggestion or is there a better way?

3
  • 2
    a CTE may work nicely for you. Commented Jul 15, 2016 at 20:37
  • Are you simply trying to avoid using more than one SELECT statement? Commented Jul 15, 2016 at 20:47
  • @Nicarus Yes. I want to remove the redundant WHERE other_col='x' (the real conditions are quite long) so removing 1 SELECT will make it clearer. Commented Jul 15, 2016 at 20:51

5 Answers 5

2

You state you want to avoid two "queries", which isn't really possible. There are plenty of solutions available, but I would use a CTE like so:

WITH cte AS
(
SELECT
    id,
    name,
    other_col,
    COUNT(name) OVER(PARTITION BY id) AS id_count
FROM
    table
)

SELECT name FROM cte WHERE id_count > 1;

You can reuse the CTE, so you don't have to duplicate logic and I personally find it easier to read and understand what it is doing.

Sign up to request clarification or add additional context in comments.

3 Comments

Absolutely perfect! Thanks Nicarus, this is exactly what I was looking for.
Good one. BTW, I have studied Temp Tables and derived tables are good in performance when comparing with CTE, am i right?
@MuthaiahPL - In my experience it depends on what you are doing, but temp tables do allow for such things as adding indexes - CTEs do not. In this case the question was not about performance, but simplicity of writing the query, which I believe the CTE provides.
2
SELECT name FROM Table
WHERE id IN (SELECT id, COUNT(*) FROM Table GROUP BY id HAVING COUNT(*)>1) Temp

3 Comments

This is pretty much identical to what I suggested and I'm wondering if there's a better way to accomplish this...
May i know in what terms you are looking for a better way?
Specifically, I just want to be able to know if this can be done in a single query because I'd rather not have to repeat two separate selects (if possible). The actual where conditions are quite long and I'd like to simplify it.
0

Use EXIST operator

SELECT * FROM table t1
WHERE EXISTS(
  SELECT null FROM table t2
  WHERE t1.id = t2.id 
    AND t1.name <> t2.name
)

Comments

0

Use a join:

select distinct name 
from view v1
join view v2 on v1.id = v2.id
  and v1.name != v2.name

The use of distinct is there in case there are more than 2 rows sharing the same id. If that's not possible, you can omit distinct.


A note: Naming a column id when it's not unique will likely cause confusion, because it's the industry standard for the unique identifier column. If there isn't a unique column at all, it will cause coding difficulties.

1 Comment

This isn't a bad suggestion but it happens that my view is large so I'd have to go on v1.id = v2.id AND v1.other_col='x' AND v2.other_col='x' and it takes longer than the accepted answer. About the id, I will make sure to name my columns more appropriately in future questions, thanks!
0

Do not use a CTE. That's typically more expensive because Postgres has to materialize the intermediary result.

An EXISTS semi-join is typically fastest for this. Just make sure to repeat predicates (or match the values):

SELECT name 
FROM   view v
WHERE  other_col = 'x'
AND    EXISTS (
   SELECT 1 FROM view 
   WHERE  other_col = 'x' -- or: other_col = v.other_col
   AND    id <> v.id      -- exclude join to self
   );

That's a single query, even if you see the keyword SELECT twice here. An EXISTS expression does not produce a derived table, it will be resolved to simple index look-ups.

Speaking of which: a multicolumn index on (other_col, id) should help. Depending on data distribution and access patterns, appending the payload column name to enable index-only scans might help: (other_col, id, name). Or even a partial index, if other_col = 'x' is a constant predicate:

CREATE INDEX ON view (id) WHERE other_col = 'x';

The upcoming Postgres 9.6 would even allow an index-only scan on the partial index:

CREATE INDEX ON view (id, name) WHERE other_col = 'x';

You will love this improvement (quoting the /devel manual):

Allow using an index-only scan with a partial index when the index's predicate involves column(s) not stored in the index (Tomas Vondra, Kyotaro Horiguchi)

An index-only scan is now allowed if the query mentions such columns only in WHERE clauses that match the index predicate

Verify performance with EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Run a couple of times to rule out caching effects.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.