I have a windows service which basically watches a folder for any CSV file. Each record in the CSV file is inserted into a SQL table. If the same CSV file is put in that folder, it can lead to duplicate record entries in the table. How can I avoid duplicate insertions into the SQL table?
-
1How do you detect a "duplicate" record, is there a unique column like a GUID you can compare the old and the new with or will you need to check that every column is the same. Also do you need to check if each row in the csv is a duplicate or just not import the same file twice?Scott Chamberlain– Scott Chamberlain2013-06-26 18:41:24 +00:00Commented Jun 26, 2013 at 18:41
-
Do you have any validation performed on the CSV file before upload?Kaizen Programmer– Kaizen Programmer2013-06-26 18:42:09 +00:00Commented Jun 26, 2013 at 18:42
-
Scott, unfortunately there is no primary key in the table. How do I compare a CSV record with the entire row of SQL table?blue piranha– blue piranha2013-06-26 18:44:54 +00:00Commented Jun 26, 2013 at 18:44
-
Can legitimate duplicates exist? Ie. same row from different source files.Hart CO– Hart CO2013-06-26 18:47:01 +00:00Commented Jun 26, 2013 at 18:47
-
Goat_CO, No they can'tblue piranha– blue piranha2013-06-26 18:49:47 +00:00Commented Jun 26, 2013 at 18:49
3 Answers
The accepted answer has a syntax error and is not compatible with relational databases like MySQL.
Specifically, the following is not compatible with most databases:
values(...) where not exists
While the following is generic SQL, and is compatible with all databases:
select ... where not exists
Given that, if you want to insert a single record into a table after checking if it already exists, you can do a simple select with a where not exists clause as part of your insert statement, like this:
INSERT
INTO table_name (
primay_col,
col_1,
col_2
)
SELECT 1234,
'val_1',
'val_2'
WHERE NOT EXISTS (
SELECT 1
FROM table_name
WHERE primary_col=1234
);
Simply pass all values with the select keyword, and put the primary or unique key condition in the where clause.
Comments
Problems with the answers using WHERE NOT EXISTS are:
- performance -- row-by-row processing requires, potentially, a very large number of table scans against
table - NULL handling -- for every column where there might be NULLs you will have to write the matching condition in a more complicated way, like
(a = @a OR (a IS NULL AND @a IS NULL)). Repeat that for 10 columns and viola - you hate SQL :)
A better answer would take into account the great SET processing capabilities that relational databases provide (in short -- never use row-by-row processing in SQL if you can avoid it. If you can't -- think again and avoid it anyway).
So for the answer:
- load (all) data into a temporary table (or a staging table that can be safely truncated before load)
- run the insert in a "set"-way:
INSERT INTO table (<columns>)
select <columns> from #temptab
EXCEPT
select <columns> from table
Keep in mind that the EXCEPT is safely dealing with NULLs for every kind of column ;) as well as choosing a high-performance join type for matching (hash, loop, merge join) depending on the available indexes and table statistics.