Remove duplicate rows based on specific columns

Question

I have a table that contains these columns:

ID (varchar)
SETUP_ID (varchar)
MENU (varchar)
LABEL (varchar)

The thing I want to achieve is to remove all duplicates from the table based on two columns (SETUP_ID, MENU).

Table I have:

id  |  setup_id  |  menu  |  label  |
-------------------------------------
1   |    10      |  main  |  txt    |
2   |    10      |  main  |  txt    |
3   |    11      | second |  txt    |
4   |    11      | second |  txt    |
5   |    12      | third  |  txt    |

Table I want:

id  |  setup_id  |  menu  |  label  |
-------------------------------------
1   |    10      |  main  |  txt    |
3   |    11      | second |  txt    |
5   |    12      | third  |  txt    |

This does not answer the question. It offers a solution to create a new table. — Gurmokh
– Gurmokh, Commented Dec 22, 2021 at 11:48
The task is not fully clear, while we don't know whether setup_idor menu can be NULL. And how to deal with NULL values if possible. Also, please always declare your version of Postgres, and tell us whether performance matters. If so, roughly how many rows in the table, roughly how many dupes? Finally, do you just want to select distinct rows, or actively delete dupes from the table? — Erwin Brandstetter
– Erwin Brandstetter, Commented Dec 22, 2021 at 15:27

Gurmokh · Accepted Answer · 2021-12-22 12:20:56Z

2

You can achieve this with a common table expression (cte)

with cte as ( 
           select id, setup_id, menu, 
                  row_number () over (partition by setup_id, menu, label) rownum
           from atable )
delete from atable a
where id in (select id from cte where rownum >= 2)

This will give you your desired output.

Common Table Expression docs

edited Dec 22, 2021 at 12:20

answered Dec 22, 2021 at 11:56

Gurmokh

2,09127 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Costi Ciudatu Over a year ago

I think the label should not be included in the window definition, if I understand the OP's requirement correctly. Also, rownum = 2 assumes that there can be at most one duplicate row for a given (setup_id, menu) tuple. To only keep one row when it's duplicated (however many times), you should use where rownum >= 2.

Gurmokh Over a year ago

Good spot, updated to remove label.

Prabhu Nandan Kumar Over a year ago

Writing subquery is a bad idea in the performance point of view.

Erwin Brandstetter · Accepted Answer · 2021-12-22 15:52:26Z

Assuming a table named tbl where both setup_id and menu are defined NOT NULL and id is the PRIMARY KEY.
EXISTS will do nicely:

DELETE FROM tbl t0
WHERE  EXISTS (
   SELECT FROM tbl t1
   WHERE  t1.setup_id = t0.setup_id
   AND    t1.menu = t0.menu
   AND    t1.id < t0.id
   );

This deletes every row where a dupe with lower id is found, effectively only keeping the row with the smallest id from each set of dupes. An index on (setup_id, menu) or even (setup_id, menu, id) will help performance with big tables a lot.

If there is no PK and no reliable UNIQUE (combination of) column(s), you can fall back to using the ctid. If NULL values can be involved, you need to specify how to deal with those.
Consider:

After cleaning up duplicates, add a UNIQUE constraint to prevent new dupes:

ALTER TABLE tbl ADD CONSTRAINT tbl_setup_id_menu_uni UNIQUE (setup_id, menu);

If you had an index on (setup_id, menu), drop that now. It's superseded by the UNIQUE constraint.

Costi Ciudatu · Accepted Answer · 2021-12-22 12:16:26Z

1

I have found a solution that fits me the best. Here it is if anyone needs it:

DELETE FROM table_name
WHERE id IN
  (SELECT id
   FROM
       (SELECT id,
               ROW_NUMBER() OVER( PARTITION BY setup_id,
     menu
    ORDER BY  id ) AS row_num
        FROM table_name ) t
   WHERE t.row_num > 1 );

edited Dec 22, 2021 at 12:16

Costi Ciudatu

38.5k7 gold badges58 silver badges95 bronze badges

answered Dec 22, 2021 at 12:06

geralt93

731 silver badge7 bronze badges

2 Comments

Erwin Brandstetter Over a year ago

Vaild. But (like some other proposed solutions here) much more expensive than necessary.

Jakob Over a year ago

In RedShift, this approach worked rather than the (perhaps more intuitive) approach @VynlJunkie provided earlier.

jian · Accepted Answer · 2021-12-22 15:23:11Z

link: https://www.postgresql.org/docs/current/queries-union.html
https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT
let's sat table name is a

select distinct on (setup_id,menu ) a.* from a;

Key point: The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

Which means you can only order by setup_id,menu in this distinct on query scope.

Want the opposite:
EXCEPT returns all rows that are in the result of query1 but not in the result of query2. (This is sometimes called the difference between two queries.) Again, duplicates are eliminated unless EXCEPT ALL is used.

SELECT * FROM a
EXCEPT 
select distinct on (setup_id,menu ) a.* from a;

Costi Ciudatu · Accepted Answer · 2021-12-22 11:52:18Z

0

You can try something along these lines to delete all but the first row in case of duplicates (please note that this is not tested in any way!):

DELETE FROM your_table WHERE id IN (
    SELECT unnest(duplicate_ids[2:]) FROM (
        SELECT array_agg(id) AS duplicate_ids FROM your_table
            GROUP BY SETUP_ID, MENU
            HAVING COUNT(*) > 1
        )
    )
)

The above collects the ids of the duplicate rows (COUNT(*) > 1) in an array (array_agg), then takes all but the first element in that array ([2:]) and "explodes" the id values into rows (unnest). The outer query just deletes every id that ends up in that result.

answered Dec 22, 2021 at 11:52

Costi Ciudatu

38.5k7 gold badges58 silver badges95 bronze badges

Comments

Imran Zahoor · Accepted Answer · 2021-12-22 11:58:46Z

0

For mysql the similar question is already answered here Find and remove duplicate rows by two columns

Try if any of the approach helps in this matter.

I like the below one for MySql:

ALTER IGNORE TABLE your_table ADD UNIQUE (SETUP_ID, MENU);

edited Dec 22, 2021 at 11:58

answered Dec 22, 2021 at 11:20

Imran Zahoor

2,8371 gold badge34 silver badges42 bronze badges

1 Comment

geralt93 Over a year ago

Im sorry...I dont know what was I thinking when I wrote Mysql...I meant postgresql for which this solution won't work.

Dharman · Accepted Answer · 2021-12-22 12:00:11Z

0

DELETE t1 
FROM table_name t1
    join table_name t2 on
    (t2.setup_id = t1.setup_id or t2.menu = t1.menu) and t2.id < t1.id

edited Dec 22, 2021 at 12:00

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Dec 22, 2021 at 11:54

Kodoku

13 bronze badges

3 Comments

Costi Ciudatu Over a year ago

I think the first condition should read (t2.setup_id = t1.setup_id AND t2.menu = t1.menu) -- i.e. AND instead of OR.

Kodoku Over a year ago

ah ok exactly AND, because i'm not very fluent in english, so i don't understand the problem

Erwin Brandstetter Over a year ago

This is not legal Postgres syntax.

Prabhu Nandan Kumar · Accepted Answer · 2021-12-22 12:21:19Z

0

There are many ways to find and delete all duplicate row(s) based on conditions. But I like inner join method, which works very fast even in a large amount of Data. Please check follows :

DELETE T1 FROM <TableName> T1
INNER JOIN <TableName> T2 
WHERE
    T1.id > T2.id AND 
    T1.<ColumnName1> = T2.<ColumnName1> AND T1.<ColumnName2> = T2.<ColumnName2>;

In your case you can write as follows :

DELETE T1 FROM <TableName> T1
    INNER JOIN <TableName> T2 
    WHERE
        T1.id > T2.id AND 
        T1.setup_id = T2. setup_id;

Let me know if you face any issue or need more help.

answered Dec 22, 2021 at 12:21

Prabhu Nandan Kumar

1,26513 silver badges22 bronze badges

1 Comment

Erwin Brandstetter Over a year ago

Neither is legal Postgres syntax.

Collectives™ on Stack Overflow

Remove duplicate rows based on specific columns

8 Answers 8

3 Comments

Comments

2 Comments

Comments

Comments

1 Comment

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

3 Comments

Comments

2 Comments

Comments

Comments

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related