2

I query 4hrs data from source PLC MS SQL db, process it with python and write the data to main Postgresql table.

While writing to main Postgres table hourly, there is a duplicate value (previous 3 hrs) -it will result in error (primary key) and prevent the transaction & python error.

So,

  1. I create a temp PostgreSQL table without any key every time hourly
  2. Then copy pandas dataframe to temp table
  3. Then insert rows from temp table --> main PostgreSQL table
  4. Drop temp PostgreSQL table

This python script runs in windows task scheduler hourly

Below is my query.

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
conn = engine.raw_connection()
cur = conn.cursor()

cur.execute("""CREATE TABLE public.table_temp
(
    datetime timestamp without time zone NOT NULL,
    tagid text COLLATE pg_catalog."default" NOT NULL,
    mc text COLLATE pg_catalog."default" NOT NULL,
    value text COLLATE pg_catalog."default",
    quality text COLLATE pg_catalog."default"
)

TABLESPACE pg_default;

ALTER TABLE public.table_temp
    OWNER to postgres;""");

output = io.StringIO()
df.to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, 'table_temp', null="")
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
cur.execute("""DROP TABLE table_temp CASCADE;""");
conn.commit()

I would like to know if there is any efficient/faster way to do it

1
  • dataframes (pandas) have a to_sql() method, you may not need to export to csv. Commented Nov 10, 2021 at 4:56

1 Answer 1

3

If I'm correct in assuming that the data is in the data frame you should just be able to do

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.drop_duplicates(subset=None)  # Replace None with list of column names that define the primary key ex. ['column_name1', 'column_name2']
df.to_sql('table_main', engine, if_exists='append')

Edit due to comment:
If that's the case you have the right idea. You can make it more efficient by using to_sql to insert the data into the temp table first like so.

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.to_sql('table_temp', engine, if_exists='replace')
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
# cur.execute("""DROP TABLE table_temp CASCADE;""");  # You can drop if you want to but the replace option in to_sql will drop and recreate the table
conn.commit()
Sign up to request clarification or add additional context in comments.

2 Comments

Suppose we query 4 hrs data, every hour say data from 5.00am to 9.00am and write it in postgresql main table at 9.05 am, the main table would already have data's of (5.00am to 08.00am) (previous 3 hrs) which was saved at 7.05am ,8.05am etc. I dont want to again write (5.00am to 08.00am) data to main table and create multiple duplicates. That is my intension. Also, you may ask why I fetch 4hrs data every time, it is just for safety. Because the task scheduler some times skips any particular hourly python schedule. Which may result in loss of that hour. Thanks!
Depending on how the data is retrieved you can store (maybe you can get it from table_main) the max timestamp of the data retrieved and start from there. This would avoid grabbing data that you've already grabbed and avoid missing any

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.