Find duplicate rows in multiple columns

Question

Here is my sql data

id      location1    location2      distance
--------------------------------------------
1       Paris        Marseille      150km
2       Paris        Lyon           200km
3       Paris        Strasbourg     300km
4       Paris        Toulouse       350km
5       Marseille    Paris          150km  <-(almost) duplicate of row 1
6       Marseille    Lyon           250km
...

Because the distance between Paris -> Marseille equals to Marseille -> Paris I want to remove one of the duplicated rows.

Table contains almost 1M rows, and half of them duplicates.How am I able to remove this duplicates for data on such a large table.

deleting 500 K rows in a single query is very consuming for the server. Are you ok with using cursors? — Adrian Nasui
– Adrian Nasui, Commented Nov 28, 2014 at 17:29

M. Page · Accepted Answer · 2014-11-28 18:01:51Z

2

This is a situation where you can join the the table with itself:

DELETE FROM city WHERE id IN (
  SELECT id
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id)

I assumed your table named city

As noted by miszyman, it is more efficient to avoid a subquery:

  DELETE c1
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id

edited Nov 28, 2014 at 18:01

answered Nov 28, 2014 at 17:26

M. Page

2,8342 gold badges23 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

miszyman Over a year ago

you could also easily transform this to delete

miszyman Over a year ago

U could also do it without the sub-query - if the table is big the execution time with a sub-query is a lot slower - as the sub-query is executed for each row

hellcode · Accepted Answer · 2014-11-28 17:25:01Z

1

If all distances are twice in your database you could achieve it easily, if you just select the ones where location1 < location2

answered Nov 28, 2014 at 17:25

hellcode

2,6881 gold badge19 silver badges22 bronze badges

Comments

Strawberry · Accepted Answer · 2014-11-28 17:31:40Z

DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table
(id      INT NOT NULL
,location1    varchar(20) not null
,location2      varchar(20) not null
,distance INT NOT NULL
,UNIQUE(location1,location2)
);

INSERT INTO my_table VALUES
(1 ,'Paris','Marseille',150),
(2 ,'Paris','Lyon',200),
(3 ,'Paris','Strasbourg',300),
(4 ,'Paris','Toulouse',350),
(5 ,'Marseille','Paris',150),
(6 ,'Marseille','Lyon',250);

DELETE x 
  FROM my_table x 
  JOIN my_table y 
    ON y.location2 = x.location1 
   AND y.location1 = x.location2 
   AND y.distance = x.distance 
   AND y.id < x.id;
Query OK, 1 row affected (0.00 sec)

SELECT * 
  FROM my_table;
+----+-----------+------------+----------+
| id | location1 | location2  | distance |
+----+-----------+------------+----------+
|  6 | Marseille | Lyon       |      250 |
|  2 | Paris     | Lyon       |      200 |
|  1 | Paris     | Marseille  |      150 |
|  3 | Paris     | Strasbourg |      300 |
|  4 | Paris     | Toulouse   |      350 |
+----+-----------+------------+----------+

Gordon Linoff · Accepted Answer · 2014-11-28 17:40:01Z

0

If half (or nearly half) are duplicates, I would go with creating a temporary table and re-inserting the data:

create temporary table tempt as
    select location1, location2, distance
    from mydata t
    where location1 < location2
    union all
    select location1, location2, distance
    from mydata t
    where not exists (select 1 from table t2 where t2.location1 = t1.location2 and t2.location2 = t1.location1);

truncate table mydata;

insert into mydata(location1, location2, distance)
    select location1, location2, distance
    from tempt;

For performance, you want an index on mydata(location1, location2):

create index idx_mydata_location1_location2 on mydata(location1, location2)

answered Nov 28, 2014 at 17:40

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Collectives™ on Stack Overflow

Find duplicate rows in multiple columns

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related