0

I have to load millions of records into redshift DB (this is a must), what would be the most efficient/fast way of doing this? Right now I'm creating a dictionary in which I'm storing chunks of rows, that are casted to strings, so that I can place them within a query-string and then, using Pandas such as:

with psycopg2.connect(prs.rs_conection_params_psycopg2) as conn:
    with conn.cursor() as c:
        c.execute(query_create_empty_main_table)

        for chunk in df_chunks.keys():

            query_to_insert_new_data = """
                INSERT INTO {}
                {}
                VALUES
                {};
                """.format(table_name, column_names, df_chunks[chunk])

            c.execute(q_inserting_new_data)

        conn.commit()        

The table is created from scratch every time, since it has dynamic information.

Would be appropriated to use Pyspark (if possible) or Parallel module? in that case, how could it be done? Thanks, regards.

1 Answer 1

1

You have a few options, however batching up inserts in not a good one!

My favorites:

  • Option 1 - Python -> S3 CSV -> Redshift using Redshift COPY command
  • Option 2 - Python -> S3 PARQUET -> Redshift using Redshift Spectrum

Your choice will depend upon the use case that you have in mind.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.