0

I have a windows service which basically watches a folder for any CSV file. Each record in the CSV file is inserted into a SQL table. If the same CSV file is put in that folder, it can lead to duplicate record entries in the table. How can I avoid duplicate insertions into the SQL table?

5
  • 1
    How do you detect a "duplicate" record, is there a unique column like a GUID you can compare the old and the new with or will you need to check that every column is the same. Also do you need to check if each row in the csv is a duplicate or just not import the same file twice? Commented Jun 26, 2013 at 18:41
  • Do you have any validation performed on the CSV file before upload? Commented Jun 26, 2013 at 18:42
  • Scott, unfortunately there is no primary key in the table. How do I compare a CSV record with the entire row of SQL table? Commented Jun 26, 2013 at 18:44
  • Can legitimate duplicates exist? Ie. same row from different source files. Commented Jun 26, 2013 at 18:47
  • Goat_CO, No they can't Commented Jun 26, 2013 at 18:49

3 Answers 3

1

Try INSERT WHERE NOT EXISTS, where a, b and c are relevant columns, @a, @b and @c are relevant values.

INSERT INTO table
(
    a,
    b,
    c
)
VALUES
(
    @a,
    @b,
    @c
)
WHERE NOT EXISTS
(
    SELECT 0 FROM table WHERE a = @a, b = @b, c = @c
)
Sign up to request clarification or add additional context in comments.

Comments

1

The accepted answer has a syntax error and is not compatible with relational databases like MySQL.

Specifically, the following is not compatible with most databases:

values(...) where not exists

While the following is generic SQL, and is compatible with all databases:

select ... where not exists

Given that, if you want to insert a single record into a table after checking if it already exists, you can do a simple select with a where not exists clause as part of your insert statement, like this:

INSERT 
INTO      table_name (
            primay_col, 
            col_1, 
            col_2
          )
SELECT    1234,
          'val_1',
          'val_2' 
WHERE     NOT EXISTS (
  SELECT  1 
  FROM    table_name 
  WHERE   primary_col=1234
);

Simply pass all values with the select keyword, and put the primary or unique key condition in the where clause.

Comments

0

Problems with the answers using WHERE NOT EXISTS are:

  1. performance -- row-by-row processing requires, potentially, a very large number of table scans against table
  2. NULL handling -- for every column where there might be NULLs you will have to write the matching condition in a more complicated way, like (a = @a OR (a IS NULL AND @a IS NULL)). Repeat that for 10 columns and viola - you hate SQL :)

A better answer would take into account the great SET processing capabilities that relational databases provide (in short -- never use row-by-row processing in SQL if you can avoid it. If you can't -- think again and avoid it anyway).

So for the answer:

  1. load (all) data into a temporary table (or a staging table that can be safely truncated before load)
  2. run the insert in a "set"-way:
INSERT INTO table (<columns>)
    select <columns> from #temptab
    EXCEPT
    select <columns> from table

Keep in mind that the EXCEPT is safely dealing with NULLs for every kind of column ;) as well as choosing a high-performance join type for matching (hash, loop, merge join) depending on the available indexes and table statistics.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.