slow update query with redshift from python 3 using psycopg2

Question

I'm using this code to update several records on Redshift (around 30.000 records per run).

cur = conn.cursor(cursor_factory=RealDictCursor)
sql_string_update = """UPDATE my_table SET "outlier_reason" = {0} WHERE "id" = {1};"""
for id, row in df_ignored.iterrows():
    sql_ = sql_string_update.format(row['outlier_reason'],id)
    cur.execute(sql_)
conn.commit()

Every run of around 30.000 elements takes up to 2 hours of execution.

Is there a way to speed up this query?

You are running 30.000 updates on the database, there is no way this can get any faster. My recommendation is create a logic to 1. create a file in S3 for insert the new rows 2. delete the rows need to be updated. 3. use copy to load data from S3 to Redshift. Let me know if you need more clarification — demircioglu
– demircioglu, Commented Nov 28, 2018 at 18:58

Red Boy · Accepted Answer · 2018-11-28 19:00:08Z

1

In think instead of touching the table and doing updates one by one, you should be using ETL way of doing things, I believe that would be much faster. Should take care of 30K records in few minutes. Here is approach.

Create a staging table, say stg_my_table (id,outlier_reason).
Write your Python programs data into a CSV file or JSON file, whatever suits your case. Save it to S3 or EC2.
Use copy command to load into stg_my_table along with ID.
Do an Update to my_table by joining it with stg_my_table using the ID and set outlier_reason.

I think above solution must reduce time of processing from 2 Hrs to few minutes. Please try this way may be manually before writing the actual code. I'm sure you will see very promising results and then optimize each of above steps one by one to even gain more performance.

answered Nov 28, 2018 at 19:00

Red Boy

5,7893 gold badges35 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

otmezger Over a year ago

Thanks for the answer. That sounds like a lot of work. I'll give it a try, pity there is no simpler solution.

Red Boy Over a year ago

@otmezger Redshift is not designed for very frequent updates and surely its not for individual single records updates as its columnar database.

Collectives™ on Stack Overflow

slow update query with redshift from python 3 using psycopg2

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related