How to retrieve a sub-array from result of array_agg?

Question

I have a SQL table in postgres 14 that looks something like this:

f_key	data1	data2	fit
1	`{'a1', 'a2'}`	`null`	3
1	`{'b1', 'b2'}`	`{'b3'}`	2
2	`{'c1', 'c2'}`	`null`	3

Note that data1 and data2 are arrays.

I need to query this so that I will get data1 and data2 that have best (highest) fit, but are not null (where possible), grouped by f_key.

So the result would look like this:

f_key	data1	data2
1	`{'a1', 'a2'}`	`{'b3'}`
2	`{'c1', 'c2'}`	`null`

My current approach is to use array_agg in this fashion:

select 
    tt.f_key,
    (array_agg(tt.data1) filter (where tt.data1 is not null))[1] as d1,
    (array_agg(tt.data2) filter (where tt.data2 is not null))[1] as d2
from
   (select * from items order by f_key, fit desc) as tt
group by f_key;

, but d1 and d2 return null in this case. What leaves me completely puzzled, is that:

(array_agg(tt.data1) filter (where tt.data1 is not null)) as d1, returns an array of arrays, as expected
(array_agg(tt.data1) filter (where tt.data1 is not null))[1][1] as d1, returns first element of first sub-array, as expected.

How do I retrieve the first sub-array from the result of array_agg?

Use slice syntax. Also, you can't omit any subscripts of a multidimensional array, that's what gets you that null: "an array reference with the wrong number of subscripts yields a null rather than an error.". — Zegarek
– Zegarek, Commented Nov 15 at 13:48

Zegarek · Accepted Answer · 2025-11-17 08:15:19Z

3

get data1 and data2 that have best (highest) fit, but are not null (where possible), grouped by f_key.

_{demo at db<>fiddle}

select distinct f_key,
       first_value(data1)over(w1 order by data1 is null, fit desc) as d1,
       first_value(data2)over(w1 order by data2 is null, fit desc) as d2
from items
window w1 as(partition by f_key);

f_key	d1	d2
1	{a1,a2}	{b3}
2	{c1,c2}	null

This emulates ignore nulls clause, or filter(where dataX is not null) by sorting them behind everything else, then grabbing first_value(). Note that false is lower than true and default ascending ordering means dataX is null end up last.

Since that's a window function, not an aggregate, distinct has to deduplicate results.

`DISTINCT ON`

with cte1 as(
  select distinct on(f_key)f_key,data1 as top1
  from items
  where data1 is not null
  order by f_key, fit desc)
,cte2 as(
  select distinct on(f_key)f_key,data2 as top2
  from items
  where data2 is not null
  order by f_key, fit desc)
select distinct f_key, top1, top2
from items
left join cte1 using(f_key)
left join cte2 using(f_key);

It grabs the whole row coinciding with each top fit value per f_key. It's possibly more efficient. To really discuss performance, we'd at least have to know the volume and characteristics of your data sets and what indexes you have in place.

Slice syntax `arr[1][:]`

How do I retrieve the first sub-array from the result of array_agg?

select f_key,
      (array_agg(data1 order by fit desc)filter(where data1 is not null))[1][:] as d1,
      (array_agg(data2 order by fit desc)filter(where data2 is not null))[1][:] as d2
from items
group by f_key;

f_key	d1	d2
1	{{a1,a2}}	{{b3}}
2	{{c1,c2}}	null

In Postgres, omitting subscripts of a multidimensional array gets you a null:

an array reference with the wrong number of subscripts yields a null rather than an error.

Note that if there's even one slice in use, all other subscripts become slices, meaning that [1][:] is the same as [1:1][:], but more importantly [2][:] becomes [1:2][:] not [2:2][:] as you might expect.

If any dimension is written as a slice, i.e., contains a colon, then all dimensions are treated as slices. Any dimension that has only a single number (no colon) is treated as being from 1 to the number specified.

Also, order by in a subquery might typically give you an ordered array but I'm not sure it guarantees it. Luckily, aggregate functions offer an internal order by.

It's worth underlining that a slice keeps the dimensionality of its source array. Since there's nothing else in there in your case, you can strip that one dimension with array(select unnest(arr)) - it will spit out all atomic elements, then re-collect them into a 1D array. Here's a whole thread just about that one topic of unwrapping Postgres arrays.

Custom aggregates

What you're trying to do is effectively a vertical coalesce()

select f_key,
       coalesce_agg(data1 order by fit desc) d1,
       coalesce_agg(data2 order by fit desc) d2
from items
group by f_key;

You can create aggregate that does exactly that:

create function coalesce_agg_sfunc(anycompatible,anycompatible)
returns anycompatible as 'select coalesce($1,$2)' language sql;

create aggregate coalesce_agg(anycompatible)( sfunc=coalesce_agg_sfunc
                                             ,stype=anycompatible);

If you were on PostrgeSQL version 16 or higher, this test seems to suggest built-in any_value() happens to act exactly the same when given an internal order by. Problem is, the maintainers are free to make it ignore that clause entirely in the future, which would make sense as an optimisation - same as for count(), sum(), avg() or any other commutative, order-insensitive aggregate.

The v14 you're on is currently the oldest supported version. Please plan an upgrade.

edited yesterday

answered Nov 15 at 14:11

Zegarek

29.8k5 gold badges27 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ValNik Nov 15 at 15:00

I found the DISTINCT ON approach interesting. For example, this option (dbfiddle.uk/qxEn_rgL)

Zegarek Nov 15 at 15:06

Looks like we tested that at about the same time but your version overlooks f_keys with both data1 and data2 holding null values. There has to be another join to distinct/aggregated f_keys: dbfiddle.uk/p7ZfwaQo It's underspecified whether that's possible, and if not, your direct join is shorter and better.

ValNik Nov 15 at 15:20

Perhaps this option looks more reliable.

fitek Nov 15 at 18:26

The slice syntax arr[1][:] almost works as I want it, but it still returns and array of length one arrays, as it is shown in the results you provided. On the other hand using array(select unnest(arr)) gives me what I need, but seems like a lot of operations to just get the first element of an array.

Zegarek 2 days ago

I agree Postgres arrays are a bit of an acquired taste. If you edit the question to add some context regarding the volume, some characteristics and the current indexing of your data sets, it would be easier to pick the right method based on performance. Readibility/clarity wise, I like the any_value()/coalesce_agg() added at the end.

ValNik · Accepted Answer · 2025-11-15 15:31:50Z

With "array_agg(data_array1)" you get 2-dimensional array.

With test data

f_key	data1	data2	fit
1	{'a1','a2'}	null	3
1	{'b1','b2'}	{'b3'}	2
2	{'c1','c2'}	null	3

select *
  ,array_dims(d1) d_d1
  ,array_dims(d2) d_d2
from(
select 
    tt.f_key,
    (array_agg(tt.data1) filter (where tt.data1 is not null))as d1,
    (array_agg(tt.data2) filter (where tt.data2 is not null))as d2
from
   (select * from items order by f_key, fit desc) as tt
group by f_key
)t;

f_key	d1	d2	d_d1	d_d2
1	{{'a1','a2'},{'b1','b2'}}	{{'b3'}}	[1:2][1:2]	[1:1][1:1]
2	{{'c1','c2'}}	null	[1:1][1:2]	null

If you want get 1-dimesional array, see example.
Here, the ordering by data1 or data2 is not so important to us as the nulls last condition.
We try take first not null value, if possible.

select *
  ,array_dims(data1) dim_data1
  ,array_dims(data2) dim_data2
from(
select f_key
  ,min(case when rn1=1 then data1 end) data1
  ,min(case when rn2=1 then data2 end) data2
from (
  select *
    ,row_number()over(partition by f_key order by fit desc,data1 desc nulls last) rn1
    ,row_number()over(partition by f_key 
        order by case when data2 is not null then 1 else 0 end desc
                ,fit,data2 nulls last) rn2
  from items as tt
  )t
group by f_key
)t2

f_key	data1	data2	dim_data1	dim_data2
1	{'a1','a2'}	{'b3'}	[1:2]	[1:1]
2	{'c1','c2'}	null	[1:2]	null

fiddle

Guillaume Outters · Accepted Answer · 2025-11-18 10:34:59Z

2

As you don't need inspecting the arrays, and just want a best fit, you can skip all window functions and array handling (although they were worth the detailed answers of both Zegarek and ValNik! Don't skip reading them)
and simply rely on a good old limit 1 correlated subquery:
select f_key, (select dataN from items i2 where i2.f_key = i.f_key order by dataN is null, fit desc limit 1) dataN, … group by f_key;

And as noted by Zegarek!, this can further be optimized by filtering out the row(s) with a fit but no data (which would have returned null) before even sorting them, as a subquery naturally returns null when it has no rows:

select
  f_key,
  (select data1 from items i2 where i2.f_key = i.f_key and data1 is not null order by fit desc limit 1) data1,
  (select data2 from items i2 where i2.f_key = i.f_key and data2 is not null order by fit desc limit 1) data2
from items i
group by f_key

f_key	data1	data2
1	{a1,a2}	{b3}
2	{c1,c2}	null

(as shown in this db<>fiddle)

Note that this more or less corresponds to Zegarek's (PostgreSQL-specific) distinct on, in that you avoid intermediate array_agg()s and array popping, letting PostgreSQL just walk through scalar values as quick as it can.

edited 13 hours ago

answered Nov 15 at 17:21

Guillaume Outters

7,4861 gold badge22 silver badges28 bronze badges

5 Comments

Zegarek Nov 15 at 17:42

I think in this case adding and i2.dataX is not null to the where would make a bit more sense. The reason I'm using the order by trick to eliminate those is because first_value() is a non-aggregate window function, so it doesn't support a filter(where, and Postgres doesn't offer an ignore nulls clause in general. If you have access to a regular where, you can be fairly certain it'll work better - on top of being simpler, clearer and easier to follow. dbfiddle.uk/iLekoCbi

Guillaume Outters 13 hours ago

Thanks, yes this was an obvious simplification and optimization that I missed with the brain in the mists of a covid.

Zegarek 13 hours ago

Sad to hear that, I hope you're getting better.

fitek Nov 15 at 18:49

It is an idea, and produces the desired results. I'm only afraid that doing (nearly identical) subquerises will have negative effect on performance.

Guillaume Outters 13 hours ago

Depending on the number of dataN columns you have to handle, and on your data (if you have a lot of null holes you could even afford partial indexes on (f_key) where data1 is not null), you could have a pleasant surprise to let PostgreSQL optimize those (apparently) multiple scalar walk-throughs with their limit 1, vs the server in-place aggregation before fetching only 1 element. Do give it a try with an explain analyze.

Collectives™ on Stack Overflow

How to retrieve a sub-array from result of array_agg?

3 Answers 3

`DISTINCT ON`

Slice syntax `arr[1][:]`

Custom aggregates

5 Comments

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Slice syntax arr[1][:]

Custom aggregates

5 Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Slice syntax `arr[1][:]`