3

I have a SQL table in postgres 14 that looks something like this:

f_key data1 data2 fit
1 {'a1', 'a2'} null 3
1 {'b1', 'b2'} {'b3'} 2
2 {'c1', 'c2'} null 3

Note that data1 and data2 are arrays.

I need to query this so that I will get data1 and data2 that have best (highest) fit, but are not null (where possible), grouped by f_key.

So the result would look like this:

f_key data1 data2
1 {'a1', 'a2'} {'b3'}
2 {'c1', 'c2'} null

My current approach is to use array_agg in this fashion:

select 
    tt.f_key,
    (array_agg(tt.data1) filter (where tt.data1 is not null))[1] as d1,
    (array_agg(tt.data2) filter (where tt.data2 is not null))[1] as d2
from
   (select * from items order by f_key, fit desc) as tt
group by f_key;

, but d1 and d2 return null in this case. What leaves me completely puzzled, is that:

  • (array_agg(tt.data1) filter (where tt.data1 is not null)) as d1, returns an array of arrays, as expected
  • (array_agg(tt.data1) filter (where tt.data1 is not null))[1][1] as d1, returns first element of first sub-array, as expected.

How do I retrieve the first sub-array from the result of array_agg?

1
  • Use slice syntax. Also, you can't omit any subscripts of a multidimensional array, that's what gets you that null: "an array reference with the wrong number of subscripts yields a null rather than an error.". Commented Nov 15 at 13:48

3 Answers 3

3

get data1 and data2 that have best (highest) fit, but are not null (where possible), grouped by f_key.

demo at db<>fiddle

select distinct f_key,
       first_value(data1)over(w1 order by data1 is null, fit desc) as d1,
       first_value(data2)over(w1 order by data2 is null, fit desc) as d2
from items
window w1 as(partition by f_key);
f_key d1 d2
1 {a1,a2} {b3}
2 {c1,c2} null

This emulates ignore nulls clause, or filter(where dataX is not null) by sorting them behind everything else, then grabbing first_value(). Note that false is lower than true and default ascending ordering means dataX is null end up last.

Since that's a window function, not an aggregate, distinct has to deduplicate results.


DISTINCT ON

with cte1 as(
  select distinct on(f_key)f_key,data1 as top1
  from items
  where data1 is not null
  order by f_key, fit desc)
,cte2 as(
  select distinct on(f_key)f_key,data2 as top2
  from items
  where data2 is not null
  order by f_key, fit desc)
select distinct f_key, top1, top2
from items
left join cte1 using(f_key)
left join cte2 using(f_key);

It grabs the whole row coinciding with each top fit value per f_key. It's possibly more efficient. To really discuss performance, we'd at least have to know the volume and characteristics of your data sets and what indexes you have in place.


Slice syntax arr[1][:]

How do I retrieve the first sub-array from the result of array_agg?

select f_key,
      (array_agg(data1 order by fit desc)filter(where data1 is not null))[1][:] as d1,
      (array_agg(data2 order by fit desc)filter(where data2 is not null))[1][:] as d2
from items
group by f_key;
f_key d1 d2
1 {{a1,a2}} {{b3}}
2 {{c1,c2}} null

In Postgres, omitting subscripts of a multidimensional array gets you a null:

an array reference with the wrong number of subscripts yields a null rather than an error.

Note that if there's even one slice in use, all other subscripts become slices, meaning that [1][:] is the same as [1:1][:], but more importantly [2][:] becomes [1:2][:] not [2:2][:] as you might expect.

If any dimension is written as a slice, i.e., contains a colon, then all dimensions are treated as slices. Any dimension that has only a single number (no colon) is treated as being from 1 to the number specified.

Also, order by in a subquery might typically give you an ordered array but I'm not sure it guarantees it. Luckily, aggregate functions offer an internal order by.

It's worth underlining that a slice keeps the dimensionality of its source array. Since there's nothing else in there in your case, you can strip that one dimension with array(select unnest(arr)) - it will spit out all atomic elements, then re-collect them into a 1D array. Here's a whole thread just about that one topic of unwrapping Postgres arrays.


Custom aggregates

What you're trying to do is effectively a vertical coalesce()

select f_key,
       coalesce_agg(data1 order by fit desc) d1,
       coalesce_agg(data2 order by fit desc) d2
from items
group by f_key;

You can create aggregate that does exactly that:

create function coalesce_agg_sfunc(anycompatible,anycompatible)
returns anycompatible as 'select coalesce($1,$2)' language sql;

create aggregate coalesce_agg(anycompatible)( sfunc=coalesce_agg_sfunc
                                             ,stype=anycompatible);

If you were on PostrgeSQL version 16 or higher, this test seems to suggest built-in any_value() happens to act exactly the same when given an internal order by. Problem is, the maintainers are free to make it ignore that clause entirely in the future, which would make sense as an optimisation - same as for count(), sum(), avg() or any other commutative, order-insensitive aggregate.


The v14 you're on is currently the oldest supported version. Please plan an upgrade.

Sign up to request clarification or add additional context in comments.

5 Comments

I found the DISTINCT ON approach interesting. For example, this option (dbfiddle.uk/qxEn_rgL)
Looks like we tested that at about the same time but your version overlooks f_keys with both data1 and data2 holding null values. There has to be another join to distinct/aggregated f_keys: dbfiddle.uk/p7ZfwaQo It's underspecified whether that's possible, and if not, your direct join is shorter and better.
Perhaps this option looks more reliable.
The slice syntax arr[1][:] almost works as I want it, but it still returns and array of length one arrays, as it is shown in the results you provided. On the other hand using array(select unnest(arr)) gives me what I need, but seems like a lot of operations to just get the first element of an array.
I agree Postgres arrays are a bit of an acquired taste. If you edit the question to add some context regarding the volume, some characteristics and the current indexing of your data sets, it would be easier to pick the right method based on performance. Readibility/clarity wise, I like the any_value()/coalesce_agg() added at the end.
2

With "array_agg(data_array1)" you get 2-dimensional array.

With test data

f_key data1 data2 fit
1 {'a1','a2'} null 3
1 {'b1','b2'} {'b3'} 2
2 {'c1','c2'} null 3
select *
  ,array_dims(d1) d_d1
  ,array_dims(d2) d_d2
from(
select 
    tt.f_key,
    (array_agg(tt.data1) filter (where tt.data1 is not null))as d1,
    (array_agg(tt.data2) filter (where tt.data2 is not null))as d2
from
   (select * from items order by f_key, fit desc) as tt
group by f_key
)t;
f_key d1 d2 d_d1 d_d2
1 {{'a1','a2'},{'b1','b2'}} {{'b3'}} [1:2][1:2] [1:1][1:1]
2 {{'c1','c2'}} null [1:1][1:2] null

If you want get 1-dimesional array, see example.
Here, the ordering by data1 or data2 is not so important to us as the nulls last condition.
We try take first not null value, if possible.

select *
  ,array_dims(data1) dim_data1
  ,array_dims(data2) dim_data2
from(
select f_key
  ,min(case when rn1=1 then data1 end) data1
  ,min(case when rn2=1 then data2 end) data2
from (
  select *
    ,row_number()over(partition by f_key order by fit desc,data1 desc nulls last) rn1
    ,row_number()over(partition by f_key 
        order by case when data2 is not null then 1 else 0 end desc
                ,fit,data2 nulls last) rn2
  from items as tt
  )t
group by f_key
)t2
f_key data1 data2 dim_data1 dim_data2
1 {'a1','a2'} {'b3'} [1:2] [1:1]
2 {'c1','c2'} null [1:2] null

fiddle

Comments

2

As you don't need inspecting the arrays, and just want a best fit, you can skip all window functions and array handling (although they were worth the detailed answers of both Zegarek and ValNik! Don't skip reading them)
and simply rely on a good old limit 1 correlated subquery:
select f_key, (select dataN from items i2 where i2.f_key = i.f_key order by dataN is null, fit desc limit 1) dataN, … group by f_key;

And as noted by Zegarek!, this can further be optimized by filtering out the row(s) with a fit but no data (which would have returned null) before even sorting them, as a subquery naturally returns null when it has no rows:

select
  f_key,
  (select data1 from items i2 where i2.f_key = i.f_key and data1 is not null order by fit desc limit 1) data1,
  (select data2 from items i2 where i2.f_key = i.f_key and data2 is not null order by fit desc limit 1) data2
from items i
group by f_key
f_key data1 data2
1 {a1,a2} {b3}
2 {c1,c2} null

(as shown in this db<>fiddle)

Note that this more or less corresponds to Zegarek's (PostgreSQL-specific) distinct on, in that you avoid intermediate array_agg()s and array popping, letting PostgreSQL just walk through scalar values as quick as it can.

5 Comments

I think in this case adding and i2.dataX is not null to the where would make a bit more sense. The reason I'm using the order by trick to eliminate those is because first_value() is a non-aggregate window function, so it doesn't support a filter(where, and Postgres doesn't offer an ignore nulls clause in general. If you have access to a regular where, you can be fairly certain it'll work better - on top of being simpler, clearer and easier to follow. dbfiddle.uk/iLekoCbi
Thanks, yes this was an obvious simplification and optimization that I missed with the brain in the mists of a covid.
Sad to hear that, I hope you're getting better.
It is an idea, and produces the desired results. I'm only afraid that doing (nearly identical) subquerises will have negative effect on performance.
Depending on the number of dataN columns you have to handle, and on your data (if you have a lot of null holes you could even afford partial indexes on (f_key) where data1 is not null), you could have a pleasant surprise to let PostgreSQL optimize those (apparently) multiple scalar walk-throughs with their limit 1, vs the server in-place aggregation before fetching only 1 element. Do give it a try with an explain analyze.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.