Avoiding duplicate record insertion into SQL table

Question

I have a windows service which basically watches a folder for any CSV file. Each record in the CSV file is inserted into a SQL table. If the same CSV file is put in that folder, it can lead to duplicate record entries in the table. How can I avoid duplicate insertions into the SQL table?

How do you detect a "duplicate" record, is there a unique column like a GUID you can compare the old and the new with or will you need to check that every column is the same. Also do you need to check if each row in the csv is a duplicate or just not import the same file twice? — Scott Chamberlain
– Scott Chamberlain, Commented Jun 26, 2013 at 18:41
Do you have any validation performed on the CSV file before upload? — Kaizen Programmer
– Kaizen Programmer, Commented Jun 26, 2013 at 18:42
Scott, unfortunately there is no primary key in the table. How do I compare a CSV record with the entire row of SQL table? — blue piranha
– blue piranha, Commented Jun 26, 2013 at 18:44
Can legitimate duplicates exist? Ie. same row from different source files. — Hart CO
– Hart CO, Commented Jun 26, 2013 at 18:47

Ruslan Osipov · Accepted Answer · 2013-06-26 18:44:15Z

1

Try INSERT WHERE NOT EXISTS, where a, b and c are relevant columns, @a, @b and @c are relevant values.

INSERT INTO table
(
    a,
    b,
    c
)
VALUES
(
    @a,
    @b,
    @c
)
WHERE NOT EXISTS
(
    SELECT 0 FROM table WHERE a = @a, b = @b, c = @c
)

answered Jun 26, 2013 at 18:44

Ruslan Osipov

5,8734 gold badges31 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jeremy Caney · Accepted Answer · 2021-09-14 15:57:28Z

The accepted answer has a syntax error and is not compatible with relational databases like MySQL.

Specifically, the following is not compatible with most databases:

values(...) where not exists

While the following is generic SQL, and is compatible with all databases:

select ... where not exists

Given that, if you want to insert a single record into a table after checking if it already exists, you can do a simple select with a where not exists clause as part of your insert statement, like this:

INSERT 
INTO      table_name (
            primay_col, 
            col_1, 
            col_2
          )
SELECT    1234,
          'val_1',
          'val_2' 
WHERE     NOT EXISTS (
  SELECT  1 
  FROM    table_name 
  WHERE   primary_col=1234
);

Simply pass all values with the select keyword, and put the primary or unique key condition in the where clause.

Megadest · Accepted Answer · 2021-09-15 00:49:09Z

Problems with the answers using WHERE NOT EXISTS are:

performance -- row-by-row processing requires, potentially, a very large number of table scans against table
NULL handling -- for every column where there might be NULLs you will have to write the matching condition in a more complicated way, like (a = @a OR (a IS NULL AND @a IS NULL)). Repeat that for 10 columns and viola - you hate SQL :)

A better answer would take into account the great SET processing capabilities that relational databases provide (in short -- never use row-by-row processing in SQL if you can avoid it. If you can't -- think again and avoid it anyway).

So for the answer:

load (all) data into a temporary table (or a staging table that can be safely truncated before load)
run the insert in a "set"-way:

INSERT INTO table (<columns>)
    select <columns> from #temptab
    EXCEPT
    select <columns> from table

Keep in mind that the EXCEPT is safely dealing with NULLs for every kind of column ;) as well as choosing a high-performance join type for matching (hash, loop, merge join) depending on the available indexes and table statistics.

Collectives™ on Stack Overflow

Avoiding duplicate record insertion into SQL table

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related