0

I have a table > 500,000 rows where there are duplicate entries that need to be removed but only where a few columns match.

The main table has the following columns

id,
countryID,
postalCode,
adminName1,
adminName2,
placeName,
adminName3,
latitude,
longitude

I need to remove duplicates (leaving the first record) where placeName, latitude & longitude match

I had a search and found this which looks right but doesnt work for me. I have duplicated the original table structure into a new table (tblTemp)

INSERT INTO tblTemp(id,countryID,postalCode,adminName1,adminName2,placeName,adminName3,latitude,longitude)
SELECT DISTINCT placeName,latitude,longitude
FROM tblCountry_admin;

But i get the error

Column count doesn't match value count at row 1

2
  • 2
    Ofcourse it will. You are trying to insert 3 values into 9 columns. Commented Feb 11, 2017 at 17:16
  • Please specify which row to consider in case there are multiple rows with same placeName,latitude,longitude ... Maybe max id? Commented Feb 11, 2017 at 17:20

3 Answers 3

1

if you want delete the duplicate row by laceName, latitude & longitude leaving the the old one you could

You could check for select the duplicated rows this wat

  select * from tblCountry_admin
  where id not in (
     select min(id)
     from tblCountry_admin
     group by placename, latitude, longitude
     )

then you could delete this way

delete from tblCountry_admin
where id not in (
   select min(id)
   from tblCountry_admin
   group by placename, latitude, longitude
   )

the error you get in you insert select id due by the fact the number of column in insert don't match the number of column is select

Sign up to request clarification or add additional context in comments.

Comments

0

Use WHERE:

INSERT INTO tblTemp (id, countryID, postalCode, adminName1,adminName2,     
                     placeName, adminName3, latitude, longitude)
    SELECT id, countryID, postalCode, adminName1, adminName2,
           placeName, adminName3, latitude, longitude
    FROM tblCountry_admin a
    WHERE a.id = (SELECT MIN(a2.id)
                  FROM tblCountry_admin a2
                  WHERE a2.placeName = a.placeName AND
                        a2.latitude = a.latitude AND
                        a2.longitude = a.longitude
                 );

Comments

0

Assuming the tblTemp has same set of columns as tblCountry_admin and you want to get the rows with max id in case of duplicates, you can use this:

INSERT INTO tblTemp
select a.*
from tblCountry_admin a left join tblCountry_admin b on a.placeName = b.placeName
    and a.latitude = b.latitude
    and a.longitude = b.longitude
    and a.id < b.id
where b.id is null;

If you want to create the table using the select use:

create table tblTemp as
select a.*
from tblCountry_admin a left join tblCountry_admin b on a.placeName = b.placeName
    and a.latitude = b.latitude
    and a.longitude = b.longitude
    and a.id < b.id
where b.id is null;

2 Comments

This option is taking a vlong time to run and if i interrupt it no entries have been added to tblTemp
Working on half a million records is going to take time. It'll however be faster than group by operation provided proper indexes were in place in your original table.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.