Redshift data loading with Python script

Question

I have to load millions of records into redshift DB (this is a must), what would be the most efficient/fast way of doing this? Right now I'm creating a dictionary in which I'm storing chunks of rows, that are casted to strings, so that I can place them within a query-string and then, using Pandas such as:

with psycopg2.connect(prs.rs_conection_params_psycopg2) as conn:
    with conn.cursor() as c:
        c.execute(query_create_empty_main_table)

        for chunk in df_chunks.keys():

            query_to_insert_new_data = """
                INSERT INTO {}
                {}
                VALUES
                {};
                """.format(table_name, column_names, df_chunks[chunk])

            c.execute(q_inserting_new_data)

        conn.commit()

The table is created from scratch every time, since it has dynamic information.

Would be appropriated to use Pyspark (if possible) or Parallel module? in that case, how could it be done? Thanks, regards.

Jon Scott · Accepted Answer · 2020-04-22 11:04:33Z

1

You have a few options, however batching up inserts in not a good one!

My favorites:

Option 1 - Python -> S3 CSV -> Redshift using Redshift COPY command
Option 2 - Python -> S3 PARQUET -> Redshift using Redshift Spectrum

Your choice will depend upon the use case that you have in mind.

answered Apr 22, 2020 at 11:04

Jon Scott

4,36420 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Redshift data loading with Python script

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related