2

Here is my sql data

id      location1    location2      distance
--------------------------------------------
1       Paris        Marseille      150km
2       Paris        Lyon           200km
3       Paris        Strasbourg     300km
4       Paris        Toulouse       350km
5       Marseille    Paris          150km  <-(almost) duplicate of row 1
6       Marseille    Lyon           250km
...

Because the distance between Paris -> Marseille equals to Marseille -> Paris I want to remove one of the duplicated rows.

Table contains almost 1M rows, and half of them duplicates.How am I able to remove this duplicates for data on such a large table.

3
  • Which one do you want to remove? Commented Nov 28, 2014 at 17:23
  • It doesnt matter. Easier to remove is ok. Commented Nov 28, 2014 at 17:25
  • 1
    deleting 500 K rows in a single query is very consuming for the server. Are you ok with using cursors? Commented Nov 28, 2014 at 17:29

4 Answers 4

2

This is a situation where you can join the the table with itself:

DELETE FROM city WHERE id IN (
  SELECT id
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id)

I assumed your table named city

As noted by miszyman, it is more efficient to avoid a subquery:

  DELETE c1
  FROM city c1, city c2
  WHERE c1.location1 = c2.location2 AND c2.location1 = c1.location2
  AND c1.id < c2.id
Sign up to request clarification or add additional context in comments.

2 Comments

you could also easily transform this to delete
U could also do it without the sub-query - if the table is big the execution time with a sub-query is a lot slower - as the sub-query is executed for each row
1

If all distances are twice in your database you could achieve it easily, if you just select the ones where location1 < location2

Comments

1
DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table
(id      INT NOT NULL
,location1    varchar(20) not null
,location2      varchar(20) not null
,distance INT NOT NULL
,UNIQUE(location1,location2)
);

INSERT INTO my_table VALUES
(1 ,'Paris','Marseille',150),
(2 ,'Paris','Lyon',200),
(3 ,'Paris','Strasbourg',300),
(4 ,'Paris','Toulouse',350),
(5 ,'Marseille','Paris',150),
(6 ,'Marseille','Lyon',250);

DELETE x 
  FROM my_table x 
  JOIN my_table y 
    ON y.location2 = x.location1 
   AND y.location1 = x.location2 
   AND y.distance = x.distance 
   AND y.id < x.id;
Query OK, 1 row affected (0.00 sec)

SELECT * 
  FROM my_table;
+----+-----------+------------+----------+
| id | location1 | location2  | distance |
+----+-----------+------------+----------+
|  6 | Marseille | Lyon       |      250 |
|  2 | Paris     | Lyon       |      200 |
|  1 | Paris     | Marseille  |      150 |
|  3 | Paris     | Strasbourg |      300 |
|  4 | Paris     | Toulouse   |      350 |
+----+-----------+------------+----------+

Comments

0

If half (or nearly half) are duplicates, I would go with creating a temporary table and re-inserting the data:

create temporary table tempt as
    select location1, location2, distance
    from mydata t
    where location1 < location2
    union all
    select location1, location2, distance
    from mydata t
    where not exists (select 1 from table t2 where t2.location1 = t1.location2 and t2.location2 = t1.location1);

truncate table mydata;

insert into mydata(location1, location2, distance)
    select location1, location2, distance
    from tempt;

For performance, you want an index on mydata(location1, location2):

create index idx_mydata_location1_location2 on mydata(location1, location2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.