Optimize query for columns with distinct values per ID

Question

Got this question for a while and wondering if there is any faster query.

I have a table with multiple entries per ID, and would like to list all columns with different values for the same ID.

ID	Brand	Type
1	Honda	Coupe
1	Jeep	SUV
2	Ford	Sedan
2	Ford	Crossover

Example for above table:
Rows with ID = 1 have different Brand and Type values, so I want one result row for each column.
For ID = 2 there is only one brand, but multiple types, so only one result row for type.

The desired result would be like this.

ID	Difference
1	Brand
1	Type
2	Type

I solved it with below query checking each column with one SELECT statement and then UNION it all:

SELECT ID, 'Brand' AS Discrepancy
FROM table
GROUP BY ID
HAVING COUNT(DISTINCT Brand) > 1

UNION 

SELECT ID,'Type' AS Discrepancy
FROM table
GROUP BY ID
HAVING COUNT(DISTINCT Type) > 1;

Is there any faster query or optimization?

Your current query is already pretty optimal.

Tim Biegeleisen
– Tim Biegeleisen

2024-02-21 04:01:41 +00:00
Commented Feb 21, 2024 at 4:01 — Tim Biegeleisen
– Tim Biegeleisen, Commented Feb 21, 2024 at 4:01
To be clear, the query is not optimal.

Erwin Brandstetter
– Erwin Brandstetter

2024-03-07 08:59:30 +00:00
Commented Mar 7, 2024 at 8:59 — Erwin Brandstetter
– Erwin Brandstetter, Commented Mar 7, 2024 at 8:59

Erwin Brandstetter · Accepted Answer · 2024-03-07 08:58:26Z

5

Your query is good for few rows per ID (except for ~~UNION~~ where it should be UNION ALL).
This one is better (improved with a hint from Charlieface):

SELECT t.id, c.difference
FROM  (
   SELECT id
        , min(brand) <> max(brand) AS b_diff
        , min(type)  <> max(type)  AS t_diff
   FROM   tbl
   GROUP  BY id
   ) t
JOIN   LATERAL (
   VALUES
     ('Brand', t.b_diff)
   , ('Type' , t.t_diff)
   ) c(difference, diff) ON c.diff
ORDER  BY 1, 2;  -- optional

fiddle

A single sequential scan should bring the cost down by almost half. Plus, avoiding the expensive count(DISTINCT ...) should help some more. Test with EXPLAIN ANALYZE. See:

Same result without using UNION ALL on repeating SELECT statements?

Note that null values are ignored by either query.

If there are many rows per ID (and an index on each of the tested columns), there are (much) faster options.
If so, and it matters, start a new question providing info for postgresql-performance questions as instructed in the tag description. Postgres version, exact table definition, test case. Most importantly, rough stats about data distribution. And post a comment here to link to the follow-up.

edited Mar 7, 2024 at 8:58

answered Feb 21, 2024 at 4:43

Erwin Brandstetter

668k159 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Charlieface Over a year ago

MIN(x) <> MAX(x) is probably much faster than doing multiple COUNT(DISTINCT

Erwin Brandstetter Over a year ago

Yes, min + max is certainly cheaper, yet. The major factor is one sequential (or index) scan instead of multiple, though. I added a variant as per your suggestion. If columns of interest have many entries per ID, and an index, there are (much) faster ways, yet ... (reposted to clarify few vs. many )

gview Over a year ago

I don't have time to explore this, as postgresql isn't in my wheelhouse, but I was in the process of suggesting the use of a window function OVER partition by ID in combination with JOIN lateral to derive a change count for each field independently. I'd be curious to see what you think of the idea.

Erwin Brandstetter Over a year ago

@gview: Since we GROUP BY id, I don't see what a window function could buy at this point ...

gview Over a year ago

It would be an alternative to group by id, and with the ability to compare each row within the window with the previous one, you could just create a value if there was a difference. Likely a similar final result, with similar performance I'm guessing.

|

Zegarek · Accepted Answer · 2024-03-07 11:18:40Z

You can use to_json() to avoid hardcoding field names: it fetches your column names to use them as keys in JSON. Then you can unpivot that with json_each_text(), spitting them back out with their corresponding values, to be checked by count(distinct v). Demo:

select id, column_name as "Difference"
from test, json_each_text(to_json(test)) as a(column_name, v)
group by id, column_name
having min(v) <> max(v);

It also does the job in a single pass and you can keep @Charlieface's trick with min<>max alternative to counting. Keep in mind that regardless of the number of columns, it only wins the "brevity" and "it's dynamic" participation awards, getting easily outperformed by the variant that doesn't multiply the set by unpivoting.

Choice boils down to speed at the cost of maintenance effort (column alteration needs to be cascaded wherever they are hardcoded), or set-it-and-forget-it convenience if you can afford the price in performance.

Collectives™ on Stack Overflow

Optimize query for columns with distinct values per ID

2 Answers 2

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related