2

I'm having trouble finding duplicates in a database table with the following setup:

==========================================================================
| stock_id  | product_id  | store_id  | stock_qty  | updated_at          |
==========================================================================
| 9990      | 51          | 1         | 13         | 2014-10-25 16:30:01 |
| 9991      | 90          | 2         | 5          | 2014-10-25 16:30:01 |
| 9992      | 161         | 1         | 3          | 2014-10-25 16:30:01 |
| 9993      | 254         | 1         | 18         | 2014-10-25 16:30:01 |
| 9994      | 284         | 2         | 12         | 2014-10-25 16:30:01 |
| 9995      | 51          | 1         | 11         | 2014-10-25 17:30:02 |
| 9996      | 90          | 2         | 5          | 2014-10-25 17:30:02 |
| 9997      | 161         | 1         | 3          | 2014-10-25 17:30:02 |
| 9998      | 254         | 1         | 16         | 2014-10-25 17:30:02 |
| 9999      | 284         | 2         | 12         | 2014-10-25 17:30:02 |
==========================================================================

Stock updates are imported into this table every hour, I'm trying to find duplicate stock entries (any rows which have a matching product id and store id) so I can delete the oldest. The query below is my attempt, by comparing product ids and store ids on a join like this I can find one set of duplicates:

SELECT s.`stock_id`, s.`product_id`, s.`store_id`, s.`stock_qty`, s.`updated_at`
FROM `stock` s
INNER JOIN `stock` j ON s.`product_id`=j.`product_id` AND s.`store_id`=j.`store_id`
GROUP BY `stock_id`
HAVING COUNT(*) > 1
ORDER BY s.updated_at DESC, s.product_id ASC, s.store_id ASC, s.stock_id ASC;

While this query will work, it doesn't find ALL duplicates, only 1 set, which means if an import goes awry and isn't noticed until the morning, there's a possibility that we'll be left with tons of duplicate stock entries. My MySQL skills are sadly lacking and I'm at a complete loss about how to find and delete all duplicates in a fast, reliable manner.

Any help or ideas are welcome. Thanks

5
  • 1
    why aren't you setting them as a composite PK? Commented Nov 27, 2014 at 15:56
  • hi @jbutler483, great question, the system we're working in has an ORM that requires the existence of a single primary key. We really on the ORM in other ways for scaffolding the CRUD management via the same system. Commented Nov 27, 2014 at 16:04
  • Plus the database will keep the oldest data, which is not the required functionality :) Commented Nov 27, 2014 at 16:05
  • maybe a command like this describes? or this Commented Nov 27, 2014 at 16:11
  • @HerbageOnion I can't see why it matters that there's a single PK - as long as you can have a unique key constraint on (product_id,store_id). Then INSERT... ON DUPLICATE KEY UPDATE will solve the problem Commented Nov 27, 2014 at 17:06

3 Answers 3

2

You can use this query:

DELETE st FROM stock st, stock st2 
WHERE st.stock_id < st2.stock_id AND st.product_id = st2.product_id AND 
st.store_id = st2.store_id;

This query will delete older record having same product_id and store_id and will keep latest record.

Sign up to request clarification or add additional context in comments.

Comments

1

A self join on store_id, product_id and 'is older' in combination with DISTINCT should give you all rows where also a newer version exists:

> SHOW CREATE TABLE stock;
CREATE TABLE `stock` (
  `stock_id` int(11) NOT NULL,
  `product_id` int(11) DEFAULT NULL,
  `store_id` int(11) DEFAULT NULL,
  `stock_qty` int(11) DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`stock_id`)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        2 |          2 |        2 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+


> SELECT DISTINCT s1.stock_id, s1.store_id, s1.product_id, s1.updated_at   
   FROM stock s1   JOIN stock s2     
     ON s1.store_id = s2.store_id    
    AND s1.product_id = s2.product_id    
    AND s1.updated_at < s2.updated_at;
+----------+----------+------------+---------------------+
| stock_id | store_id | product_id | updated_at          |
+----------+----------+------------+---------------------+
|        2 |        2 |          2 | 2001-01-01 12:00:00 |
+----------+----------+------------+---------------------+

> DELETE stock FROM stock 
               JOIN stock s2  ON stock.store_id = s2.store_id  
                             AND stock.product_id = s2.product_id 
                             AND stock.updated_at < s2.updated_at;
Query OK, 1 row affected (0.02 sec)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+

9 Comments

Which can be combined with DELETE by only selecting s1.stock_id and combining that query with DELETE as ` DELETE FROM stock WHERE stock_id IN (SELECT DISTINCT s1.stock_id FROM ...) `
Thanks for your help Hartmut, did you mean to have s1.store_id = s2.store_id instead of s1.stock_id = s2.stock_id? I've tried that and I don't get any results...
Just to be clear, there are no results but also no errors. Thanks for your help so far.
please try the query in its current (edited/fixed) form
I tried it again, I think the sq is another typo, so I used s1, still no results though
|
1

Or you can use a stored Procedure:

DELIMITER //
DROP PROCEDURE IF EXISTS removeDuplicates;


 CREATE PROCEDURE removeDuplicates(
   stockID INT
 )
 BEGIN


    DECLARE stockToKeep INT;
    DECLARE storeID INT;
    DECLARE productID INT;

 -- gets the store and product value
 SELECT DISTINCT store_id, product_id
 FROM stock
  WHERE stock_id = stockID  
  LIMIT 1
 INTO
  storeID, productID;

 SELECT stock_id
 FROM stock
  WHERE product_id = productID AND store_id = storeID  
  ORDER BY updated_at DESC
  LIMIT 1
 INTO
  stockToKeep;

    DELETE FROM stock 
    WHERE product_id = productID AND store_id = storeID 
    AND stock_id != stockToKeep;
END //
DELIMITER ;

And afterwards call it for every pair of the product id and store id via a cursor procedure: DELIMITER // CREATE PROCEDURE updateTable() BEGIN DECLARE done BOOLEAN DEFAULT FALSE; DECLARE stockID INT UNSIGNED; DECLARE cur CURSOR FOR SELECT DISTINCT stock_id FROM stock; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;

  OPEN cur;

  testLoop: LOOP
    FETCH cur INTO stockID;
    IF done THEN
      LEAVE testLoop;
    END IF;
    CALL removeDuplicates(stockID);
  END LOOP testLoop;

  CLOSE cur;
END//
DELIMITER ;

And then just call the second procedure

CALL updateTable();

1 Comment

Hi thanks for the detailed answer, I have to be honest, this scares the crap out of me, I don't doubt you know your stuff and that's probably one hell of a solution but I need to be able to maintain the solution I implement and I don't think I could with that :( Thanks very much for your time, I'll be deconstructing your solution to learn more about stored procedures!! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.