I have thousands of lines of duplicate data in PostgreSQL database. To find out which row are duplicated, I am using this code:
SELECT "Date" FROM stockdata
group by "Date"
having count("Date")>1
This has produced again thousands of lines of date column which have more then 1 entry. How can I remove the row with the date so that just 1 entry of the duplicated item remains.
P.S I cannot use a Primary Key when entering data.
Update
As per the comment. There is no primary key. Also the Date is unique thus there cannot be 2 or more of it. df look like this:
Date High Low Open Close Volume Adj Close
0 2017-04-03 893.489990 885.419983 888.000000 891.510010 3422300 891.510010
1 2017-04-04 908.539978 890.280029 891.500000 906.830017 4984700 906.830017
2 2017-04-05 923.719971 905.619995 910.820007 909.280029 7508400 909.280029
3 2017-04-06 917.190002 894.489990 913.799988 898.280029 6344100 898.280029
4 2017-04-07 900.090027 889.309998 899.650024 894.880005 3710900 894.880005
... ... ... ... ... ... ... ...
12595 2022-03-28 1097.880005 1053.599976 1065.099976 1091.839966 34168700 1091.839966
12596 2022-03-29 1114.770020 1073.109985 1107.989990 1099.569946 24538300 1099.569946
12597 2022-03-30 1113.949951 1084.000000 1091.170044 1093.989990 19955000 1093.989990
12598 2022-03-31 1103.140015 1076.640015 1094.569946 1077.599976 16265600 1077.599976
12599 2022-04-01 1094.750000 1066.640015 1081.150024 1076.352783 11449987 1076.352783
12600 rows × 7 columns
- The data is repeated a few times at places.
- However the rows with the same date with have the same data.
- This data is not a stock data (i am using it as a troubleshoot example) but from yokogawa datalogger. https://www.yokogawa.com/in/solutions/products-platforms/data-acquisition/data-logger/#Overview There are redundancies in the system and the earlier integrator had just dumped all the data on 1 database and thus if redundant logger comes online, the database has multiple entries. I need to remove it so we can actually use the data. I don't have access to their software.
Further Update:
Using this code as suggested in the comments:
delete from stockdata s
using
(SELECT "Date" , max(ctid) as max_ctid from stockdata group by "Date") t
where s.ctid<>t.max_ctid
and s."Date"=t."Date";
It was able to do the job but going forward, is this dangerous solution for production?
stockdataI would expect there to be many lines that share a date. 2) Are you saying there is no Primary Key? 3) Define what a duplicated item is? Add answers as update to question.stockdata? Add answers as update to question.UNIQUEconstraint on the date field in the table so the data can't be duplicated. 4) The query seems sound assuming the redundant logger is not actually send different data for a given date. It is better to keep the problem from happening then to do a clean up afterwards.