0

I have thousands of lines of duplicate data in PostgreSQL database. To find out which row are duplicated, I am using this code:

SELECT "Date" FROM stockdata
group by "Date"
having count("Date")>1

This has produced again thousands of lines of date column which have more then 1 entry. How can I remove the row with the date so that just 1 entry of the duplicated item remains.

P.S I cannot use a Primary Key when entering data.

Update

As per the comment. There is no primary key. Also the Date is unique thus there cannot be 2 or more of it. df look like this:

    Date    High    Low     Open    Close   Volume  Adj Close
0   2017-04-03  893.489990  885.419983  888.000000  891.510010  3422300     891.510010
1   2017-04-04  908.539978  890.280029  891.500000  906.830017  4984700     906.830017
2   2017-04-05  923.719971  905.619995  910.820007  909.280029  7508400     909.280029
3   2017-04-06  917.190002  894.489990  913.799988  898.280029  6344100     898.280029
4   2017-04-07  900.090027  889.309998  899.650024  894.880005  3710900     894.880005
...     ...     ...     ...     ...     ...     ...     ...
12595   2022-03-28  1097.880005     1053.599976     1065.099976     1091.839966     34168700    1091.839966
12596   2022-03-29  1114.770020     1073.109985     1107.989990     1099.569946     24538300    1099.569946
12597   2022-03-30  1113.949951     1084.000000     1091.170044     1093.989990     19955000    1093.989990
12598   2022-03-31  1103.140015     1076.640015     1094.569946     1077.599976     16265600    1077.599976
12599   2022-04-01  1094.750000     1066.640015     1081.150024     1076.352783     11449987    1076.352783

12600 rows × 7 columns
  1. The data is repeated a few times at places.
  2. However the rows with the same date with have the same data.
  3. This data is not a stock data (i am using it as a troubleshoot example) but from yokogawa datalogger. https://www.yokogawa.com/in/solutions/products-platforms/data-acquisition/data-logger/#Overview There are redundancies in the system and the earlier integrator had just dumped all the data on 1 database and thus if redundant logger comes online, the database has multiple entries. I need to remove it so we can actually use the data. I don't have access to their software.

Further Update:

Using this code as suggested in the comments:

delete from stockdata s
using 
(SELECT "Date" , max(ctid) as max_ctid from stockdata group by "Date") t
where s.ctid<>t.max_ctid
and s."Date"=t."Date";

It was able to do the job but going forward, is this dangerous solution for production?

4
  • 1
    1) Given the table name stockdata I would expect there to be many lines that share a date. 2) Are you saying there is no Primary Key? 3) Define what a duplicated item is? Add answers as update to question. Commented Apr 1, 2022 at 17:52
  • 2
    Obviously the date is not unique, at least in the database, or you would not be at this point. There is no stock or index symbol? Have the exact same 12600 rows been imported repeatedly? Or are the dates the same for each import and the rest of the data different? What is the table definition for stockdata? Add answers as update to question. Commented Apr 1, 2022 at 18:14
  • Could you please have a look at a possible solution i have posted. Please advise if this is dangerous in production environment, as i am just trying to remove rows with common dates and keep just 1 entry. Also i have updated the question Commented Apr 1, 2022 at 19:00
  • 1
    1) When presenting a problem provide data that matches what you dealing with 2) Is it really dumped into one database or one table? 3) Add a UNIQUE constraint on the date field in the table so the data can't be duplicated. 4) The query seems sound assuming the redundant logger is not actually send different data for a given date. It is better to keep the problem from happening then to do a clean up afterwards. Commented Apr 1, 2022 at 19:16

1 Answer 1

2

This should do the trick:

 DELETE FROM
     stockdata a
         USING stockdata b
 WHERE
     a.id < b.id
     AND a.Date = b.Date;

But be careful, this will immediately delete all duplicates. There is no way to restore them.

Sign up to request clarification or add additional context in comments.

4 Comments

Long version: This answer is making the assumption that all rows that share a date are duplicated except one. And that and id exists which has not been established and even if it exists that the sequence matters which is another assumption. Short version: This answer is not and is dangerous.
Yes, you are right: it was not said, that an id col would exist. The query in the question was grouped by the date, so it is pretty obvious, what a duplicate would be. If there where more cols to check, you can just add them in the WHERE
I dont have an id col though!
That dataframe has row number which can be used like a serial PK. Otherwise, you can look into finding the oid of the records. oid is internal to Postgres. I agree with Adrian, dangerous solution. Good luck getting back your deletions if you delete more than you expect.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.