SQL - efficient way to aggregate boolean values (postgresql)

Question

Let's assume table with 3 columns (originally it's a big table): id, is_deleted, date. I have to check if given id's are deleted or not and create new column with this value (TRUE or FALSE). Let's simplify it to below table (before):

id	is_deleted	date
A	False	03-07-2022
A	True	04-07-2022
B	False	05-07-2022
B	False	06-07-2022
C	True	07-07-2022

(after):

id	is_deleted	date	deleted
A	True	03-07-2022	TRUE
A	False	04-07-2022	TRUE
B	False	05-07-2022	FALSE
B	False	06-07-2022	FALSE
C	True	07-07-2022	TRUE

So we can see that row with ids A and C should have True value in new column. For given id could be more than one TRUE value in is_deleted column. If any id has at least one TRUE value, all rows with given id should be deleted (TRUE value in new column). I need to do it inside this table, without group by, cuz by choosing group by, I have to create another CTE to join it with and it complicates a problem and performance.

I want to just create single column inside this table with new deleted value.

I've found bool_or function, but it won't work with window functions in redshift, my code:

bool_or(is_deleted) over(partition by id) as is_del

I can't use max, sum functions on boolean. Casting bool to int worsens the performance. Is there any other way to do it using booleans and keep good performance?

Thank you.

As for the formatting of the table, for some reason Stackoverflow shows it working fine in the preview, but unless you have a blank line before and after the table, it will show up as a garbled mess when you submit. I've edited your question to add that blank line. Hope Stackoverflow fixes that one soon. It's been broken since they introduced table markup. — JNevill
– JNevill, Commented Jan 23, 2023 at 22:59
Would both rows of A have a is_del value of True, or just the one row with is_deleted = True? It's not clear to me. Perhaps sharing Desired Results after this operation is complete would help clarify. — JNevill
– JNevill, Commented Jan 23, 2023 at 23:03
Yes, both can have TRUE. If there's one ore more TRUE value for given id, it should be deleted. — Joe
– Joe, Commented Jan 23, 2023 at 23:07
The documentation for the MAX window function states "Accepts any data type as input. Returns the same data type as expression.". See docs.aws.amazon.com/redshift/latest/dg/r_WF_MAX.html Are you saying that the documentation is incorrect? — Bill Weiner
– Bill Weiner, Commented Jan 24, 2023 at 15:38

Lukasz Szozda · Accepted Answer · 2023-01-26 17:41:36Z

5

+25

It should be possible to emulate such behaviour with MIN/MAX functions and explicit casting:

SELECT MAX(is_deleted::INT) OVER (PARTITION BY id)
FROM ...;
-- if all is_deleted are false, then result is 0, 1 otherwise

If the result should be boolean, then: MAX(is_deleted::INT) OVER (PARTITION BY id) = 1 or ( MAX(is_deleted::INT) OVER (PARTITION BY id))::BOOLEAN

answered Jan 26, 2023 at 17:41

Lukasz Szozda

181k26 gold badges278 silver badges326 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Joe Over a year ago

Its's how I did it. Is it a good approach to do double casting in such case? Or is there better option, cuz of better performance.

Lukasz Szozda Over a year ago

@Joe I would not expect significant performance implications

SebCza · Accepted Answer · 2023-01-27 10:14:33Z

1

From me here is 2 diffrent way you could check:

1.With EXISTS, which work very well in very redundant table

SELECT
    id
    , is_deleted
    , date
    , NVL((SELECT 'TRUE' FROM dual WHERE EXISTS (SELECT 1 FROM yourtabletable yt2 WHERE 
        yt2.id = yt1.id 
            AND yt2.is_deleted = 'True')
    ), 'FALSE') deleted
FROM 
    yourtabletable yt1;

2.With WITH where you could use hint's like /*+ materialize */

WITH tmp AS(
    SELECT /*+ materialize */ id, 'TRUE' deleted FROM yourtabletable WHERE is_deleted = 'True'
)

SELECT
    id
    , is_deleted
    , date
    , NVL((SELECT deleted FROM tmp yt2 WHERE 
        yt2.id = yt1.id 
            AND yt2.is_deleted = 'True'
    ), 'FALSE') deleted
FROM 
    yourtabletable yt1;

edited Jan 27, 2023 at 10:14

answered Jan 27, 2023 at 10:13

SebCza

974 bronze badges

Comments

Booboo · Accepted Answer · 2023-01-29 12:31:18Z

If I understand the problem, then I would think that for each unique id value you should be looking at the is_deleted value that has the latest (maximum) date value. In this way even though there may be a row where is_deleted is true, if there is another row for the same id value with a later date that has is_deleted as false, then false should be the final status. If this isn't how the new deleted column should be computed, then just ignore this answer, please.

Schema (PostgreSQL v15)

CREATE TABLE Table1
    ("id" varchar(1), "is_deleted" bool, "date" timestamp)
;
    
INSERT INTO Table1
    ("id", "is_deleted", "date")
VALUES
    ('A', False, '2022-03-07 00:00:00'),
    ('A', True, '2022-04-07 00:00:00'),
    ('A', True, '2022-04-09 00:00:00'), /* another True row for A */
    ('B', False, '2022-05-07 00:00:00'),
    ('B', False, '2022-06-07 00:00:00'),
    ('C', True, '2022-07-07 00:00:00')
;

Query #1

with lastest_is_deleted as (
    select t.* from
        (select t.id, t.is_deleted as deleted, row_number() over (partition by id order by date desc) as seqnum
            from Table1 t
         ) t
    where seqnum = 1
)

select t.*, l.deleted from
Table1 t join lastest_is_deleted l on t.id = l.id;

id	is_deleted	date	deleted
A	false	2022-03-07T00:00:00.000Z	true
A	true	2022-04-07T00:00:00.000Z	true
A	true	2022-04-09T00:00:00.000Z	true
B	false	2022-05-07T00:00:00.000Z	false
B	false	2022-06-07T00:00:00.000Z	false
C	true	2022-07-07T00:00:00.000Z	true

View on DB Fiddle

PRATHAMESH JOSHI · Accepted Answer · 2023-01-31 13:48:38Z

This is one of the approach with which you can get all records with their respective deleted column values.

   select a.*,case when  b.id is not null then 'TRUE' else 'FALSE' end as deleted 
from table1 a  left join  (select distinct id from table1  where is_deleted is true) b  on (a.id=b.id) order by 1,3;

I have created sample schema here :https://www.db-fiddle.com/f/4k32Eb1t2DSUQ6FkzKBMXi/0 Feel free to customize it with your data.

CREATE TABLE Table1
("id" varchar(1), "is_deleted" bool, "date" timestamp);

INSERT INTO Table1
    ("id", "is_deleted", "date")
VALUES
    ('A', False, '2022-03-07 00:00:00'),
    ('A', True, '2022-04-07 00:00:00'),
    ('A', True, '2022-04-09 00:00:00'), /* another True row for A */
    ('B', False, '2022-05-07 00:00:00'),
    ('B', False, '2022-06-07 00:00:00'),
    ('C', True, '2022-07-07 00:00:00')
;
INSERT INTO Table1
    ("id", "is_deleted", "date")
VALUES
    ('D', False, '2022-03-07 00:00:00'),
    ('D', false, '2022-04-06 00:00:00');
    
INSERT INTO Table1
    ("id", "is_deleted", "date")
VALUES
    ('C', False, '2022-03-07 00:00:00');

Trung Duong · Accepted Answer · 2023-01-31 15:41:59Z

0

In your case, I think using UNION ALL of 2 sub queries could yield better performance than using window functions, especially if your table have index on id and is_deleted columns.

SELECT 
  d1.*,
  TRUE AS deleted
FROM <your table> d1
WHERE EXISTS (SELECT 1 
              FROM <your table> d2
              WHERE d1.id = d2.id AND is_deleted)
UNION ALL 
SELECT 
  d1.*,
  FALSE AS deleted
FROM <your table> d1
WHERE NOT EXISTS (SELECT 1 
              FROM <your table> d2
              WHERE d1.id = d2.id AND is_deleted);

See demo here

answered Jan 31, 2023 at 15:41

Trung Duong

3,4752 gold badges10 silver badges10 bronze badges

Comments

Luuk · Accepted Answer · 2023-01-27 17:49:14Z

-1

This select statement should give the needed output:

select
   yt1.id,  
   yt1.is_deleted,
   yt1.date,
   case when yt2.is_deleted then true else false end as deleted
from yourtabletable yt1
left join yourtabletable yt2 on yt2.id = yt1.id and yt2.is_deleted

edited Jan 27, 2023 at 17:49

answered Jan 26, 2023 at 17:57

Luuk

15.4k5 gold badges28 silver badges44 bronze badges

7 Comments

Booboo Over a year ago

In addition to a missing comma and ambiguous column names, if there were, for example, anther row with values ('A', False, '2022-03-09 00:00:00'), /* another False row for A */, then you would be returning duplicate rows. `

Luuk Over a year ago

oops I corrected the ambiguous names, and added the (missing) comma.

Booboo Over a year ago

See this demo of the third issue, which might be a possibility though the data the OP shows is not clear on that issue. But I wouldn't make any assumptions.

Luuk Over a year ago

Adding DISTINCT solves that. But I (choose) not to add that to my statement in the answer (because it's unknown if that can happen in the questioned problem)

Booboo Over a year ago

So I posted a question to the OP asking whether it is possible to have such a row, which is better than hiding one's head in the sand.

|

Collectives™ on Stack Overflow

SQL - efficient way to aggregate boolean values (postgresql)

6 Answers 6

2 Comments

Comments

Comments

Comments

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

Comments

Comments

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related