4

I have a table which is as follows:

emp_name   emp_address  sex  matial_status  
uuuu       eee          m    s
iiii       iii          f    s
uuuu       eee          m    s

I want to remove the duplicate entries based on 3 fields emp_name, emp_address and sex. and my resultant table (after removing the duplicates) should look like -

emp_name    emp_address   sex   marital_status
uuuu        eee           m     s
iiii        iii           f     s

I am not able to recall how to write a SQL Query for this. an anyone pls help?

1
  • 3
    If you're not going to base duplication on all the columns of the row, then when a duplicate is found, how will you decide which row to keep? Commented Oct 6, 2011 at 14:52

8 Answers 8

5

I would create a new table with a unique index over the columns that you want to keep unique. Then do an insert from the old table into the new, ignoring the warnings about duplicated rows. Lastly, I would drop (or rename) the old table and replace it with the new table. In MySQL, this would look like

CREATE TABLE tmp LIKE mytable;
ALTER TABLE tmp ADD UNIQUE INDEX myindex (emp_name, emp_address, sex, marital_status);
INSERT IGNORE INTO tmp SELECT * FROM mytable;
DROP TABLE mytable;
RENAME TABLE tmp TO mytable;

Or something similar (this is totally untested).

Sign up to request clarification or add additional context in comments.

Comments

4

This is not a query but a delete statement. It will delete/remove duplicate rows from your table

;with C as
(
  select row_number() over(partition by DUPLICATE_VAARS_DECISION 
                           order by NODE_EQ_NO) as rn
  from yourtable
)
delete C
where rn > 1

If you are only interested in querying the table and get the non duplicates as a result you should use this instead.

;with C as
(
  select *,
         row_number() over(partition by DUPLICATE_VAARS_DECISION 
                           order by NODE_EQ_NO) as rn
  from yourtable
)
select *
from C
where rn = 1

1 Comment

Thanks this works! For the first statement to delete duplicates, it's more understandable like this: ;with C as ( select row_number() over(partition by Description order by Description) as rn from [YourTable] ) delete C where rn > 1
2

It looks like all four column values are duplicated so you can do this -

select distinct emp_name, emp_address, sex, marital_status
from YourTable

However if marital status can be different and you have some other column based on which to choose (for eg you want latest record based on a column create_date) you can do this

select emp_name, emp_address, sex, marital_status
from YourTable a
where not exists (select 1 
                   from YourTable b
                  where b.emp_name = a.emp_name and
                        b.emp_address = a.emp_address and
                        b.sex = a.sex and
                        b.create_date >= a.create_date)

1 Comment

This doesnt answer his question imo. He wants a UPDATE or DELETE FROM statement, not a single SELECT statement that is not permanent and does not alter the table in any way.
2

one way

select emp_name,   emp_address,  sex,  max(marital_status) as marital_status
from Yourtable
group by emp_name,   emp_address,  sex

Since I don't know what you want, I used max for the marital status

See also Including an Aggregated Column's Related Values for more examples

2 Comments

why you have used max function??
See @Ralph's comment on your question. What is your logic for determining which duplicate marital_status to keep?
0

If you are okay with trading space for performance and simplicity then the duplicates in emp_name | emp_address | sex combo can be eliminated, by the introduction of a calculated/derived column using CHECKSUM() TSQL method and DISTINCT keyword while querying.

Heres an example of CHECKSUM :

SELECT CHECKSUM(*) FROM HumanResources.Employee WHERE EmployeeID = 2

Google around and create a dependent column that contains the checksum of the 3 columns. Then you can select distinct rows by looking at this question

1 Comment

I also invite some critiques on this answer - I need to know if this is good enough (even for a table with 800k rows)
0

The best answer is here:
Use this SQL statement to identify the extra duplicated rows:

 select * from Employee a 
where %%physloc%% >
(select min(%%physloc%%) from Employee b
where a.emp_name=b.emp_name and a.emp_address=b.emp_address and a.sex=b.sex);

you will get the extra row:

uuuu   eee m   s 


Use this SQL statement to delete the extra duplicated rows:

 delete from Employee a 
where %%physloc%% >
(select min(%%physloc%%) from Employee b
where a.emp_name=b.emp_name and a.emp_address=b.emp_address and a.sex=b.sex);


For all duplicated records, only the one with lowest physical location is kept. This method can be applied to remove all kinds of duplicated rows.

I am assuming that you use MS SQL Server. If you are using Oracle DB, then you can just replace '%%physloc%%' with 'rowid'

Enjoy the code!

Comments

0

I know this is old post, but recently I tested a solution and want to share if any one can find my solution helpful -

CREATE TABLE tmpTable LIKE yourTable; insert into tmpTable (col1, col2 ... colN) SELECT distinct col1, col2 ... colN FROM yourTable WHERE 1; drop table yourTable; RENAME TABLE tmpTable TO yourTable;

Please note, insert into statement may execute without primary key.

Thanks.

Comments

0

If you don't satisfied with distinct try below

SELECT MAX(ID) AS MaxRecordID, max(FirstName) AS fname
    FROM [SampleDB].[dbo].[Employee]
    GROUP BY [FirstName], 
             [LastName], 
             [Country]

Use the Max key word with groupBy. You can use max for any type column. Integer, Varchar and etc.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.