Dataframe to PostgreSQL DB

Question

I query 4hrs data from source PLC MS SQL db, process it with python and write the data to main Postgresql table.

While writing to main Postgres table hourly, there is a duplicate value (previous 3 hrs) -it will result in error (primary key) and prevent the transaction & python error.

So,

I create a temp PostgreSQL table without any key every time hourly
Then copy pandas dataframe to temp table
Then insert rows from temp table --> main PostgreSQL table
Drop temp PostgreSQL table

This python script runs in windows task scheduler hourly

Below is my query.

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
conn = engine.raw_connection()
cur = conn.cursor()

cur.execute("""CREATE TABLE public.table_temp
(
    datetime timestamp without time zone NOT NULL,
    tagid text COLLATE pg_catalog."default" NOT NULL,
    mc text COLLATE pg_catalog."default" NOT NULL,
    value text COLLATE pg_catalog."default",
    quality text COLLATE pg_catalog."default"
)

TABLESPACE pg_default;

ALTER TABLE public.table_temp
    OWNER to postgres;""");

output = io.StringIO()
df.to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, 'table_temp', null="")
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
cur.execute("""DROP TABLE table_temp CASCADE;""");
conn.commit()

I would like to know if there is any efficient/faster way to do it

dataframes (pandas) have a to_sql() method, you may not need to export to csv. — monkut
– monkut, Commented Nov 10, 2021 at 4:56

phil · Accepted Answer · 2021-11-10 05:21:36Z

3

If I'm correct in assuming that the data is in the data frame you should just be able to do

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.drop_duplicates(subset=None)  # Replace None with list of column names that define the primary key ex. ['column_name1', 'column_name2']
df.to_sql('table_main', engine, if_exists='append')

Edit due to comment:
If that's the case you have the right idea. You can make it more efficient by using to_sql to insert the data into the temp table first like so.

engine = create_engine('postgresql://postgres:postgres@host:port/dbname?gssencmode=disable')
df.to_sql('table_temp', engine, if_exists='replace')
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
# cur.execute("""DROP TABLE table_temp CASCADE;""");  # You can drop if you want to but the replace option in to_sql will drop and recreate the table
conn.commit()

edited Nov 10, 2021 at 5:21

answered Nov 10, 2021 at 4:58

phil

4232 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user_v27 Over a year ago

Suppose we query 4 hrs data, every hour say data from 5.00am to 9.00am and write it in postgresql main table at 9.05 am, the main table would already have data's of (5.00am to 08.00am) (previous 3 hrs) which was saved at 7.05am ,8.05am etc. I dont want to again write (5.00am to 08.00am) data to main table and create multiple duplicates. That is my intension. Also, you may ask why I fetch 4hrs data every time, it is just for safety. Because the task scheduler some times skips any particular hourly python schedule. Which may result in loss of that hour. Thanks!

phil Over a year ago

Depending on how the data is retrieved you can store (maybe you can get it from table_main) the max timestamp of the data retrieved and start from there. This would avoid grabbing data that you've already grabbed and avoid missing any

Collectives™ on Stack Overflow

Dataframe to PostgreSQL DB

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related