Loading data from AWS redshift using python

Question

I'm facing a mission impossible to extract a huge amount of data from Amazone Redshift to another table. It definitely requires a more efficient approach but I'm new to SQL and AWS so decided to ask this smart community for advice.

This is my initial SQL query which takes forever:

-- STEP 1: CREATE A SAMPLE FOR ONE MONTH
SELECT DISTINCT at_id, utc_time, name
INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create' 
AND (dt BETWEEN '20181001' AND '20181031');

What would be the best approach? I was thinking of using python and sqlalchemy to create dataframes with chunks of 1m rows and inserting it back into the new table (which I need to create beforehand). Would this work?:

from sqlalchemy import create_engine
import os
import pandas as pd

redshift_user = os.environ['REDSHIFT_USER']
redshift_password = os.environ['REDSHIFT_PASSWORD']

engine_string = "postgresql+psycopg2://%s:%s@%s:%d/%s" \
% (redshift_user, redshift_password, 'localhost', XXXX, 'redshiftdb')
engine = create_engine(engine_string)

for df in pd.read_sql_query("""
                        SELECT DISTINCT at_id, utc_time, name
                        INSERT INTO my_new_table
                        FROM s3_db.table_x
                        WHERE type = 'create' 
                        AND (dt BETWEEN '20181001' AND '20181031');
                       """, engine, chunksize=1000000):

you’re moving data from one redshift table to another redshift table? — gold_cy
– gold_cy, Commented Jan 12, 2019 at 13:40
database operation will ultimately be faster than going to pandas and then back to redshift, the problem is it takes to long right now? — gold_cy
– gold_cy, Commented Jan 12, 2019 at 14:42
Exactly - nearly impossible to complete as I get broken pipe time to time a loose everything. Is there a way doing it in chunks in SQL? — aviss
– aviss, Commented Jan 12, 2019 at 15:43
You need to take a more detailed look at how the source table is structured. Is it an actual Redshift table, or as the name "s3_db" seems to imply, is it a Spectrum table? If the latter, the data may exist as files in S3 rather than Redshift itself and you can potentially use other tools such as EMR/Spark, Athena or Glue to create the new data set. If it IS an actual Redshift table then check what the distribution key and sort key are for the source table. If you can apply a filter to the sort key this may help speed up the query. — Nathan Griffiths
– Nathan Griffiths, Commented Jan 12, 2019 at 20:03

John Rotenstein · Accepted Answer · 2019-01-12 15:01:55Z

3

You should use CREATE TABLE AS.

This allows you to specify a SELECT statement and have the results directly stored into a new table.

This is hugely more efficient than downloading data and re-uploading.

You can also CREATE TABLE LIKE and then load it with data. See: Performing a Deep Copy

You could also UNLOAD data to Amazon S3, then load it again via COPY, but using CREATE TABLE AS is definitely the best option.

answered Jan 12, 2019 at 15:01

John Rotenstein

273k28 gold badges456 silver badges541 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jon Scott Over a year ago

John, this is good general logic but in this case i the the OP is trying to copy from s3 to redshift (we need clarification)

Manash Deb · Accepted Answer · 2019-01-14 03:46:13Z

Please refer AWS guidelines for RedShift and Spectrum best practices; I've put the links at the end of this post. Based on your question, I am assuming you want to extract, transform and load huge amount of data from RedShift Spectrum based table "s3_db.table_x" to new RedShift table "my_new_table"

Here are some suggestions based on AWS recommendations:

Create your RedShift table with appropriate distribution key, sort key and compression encoding. At high level, "at_id" seems best suited as partition key and "utc_time" as sortkey for your requirement, but make sure to refer AWS guidelines for RedShift table design 3.
As you mentioned, your data volume is huge, you may like to have your S3 source table "s3_db.table_x" partitioned based on "type" and "dt" columns (as suggested at point number 4 in spectrum best practices 1).
Replace DISTINCTwith GROUP BY in the select query from Spectrum (point number 9 in Spectrum Best Practices 1).
AWS recommends (point number 7 in Spectrum best practices 1) to simplify your ETL process using CREATE TABLE AS SELECT or SELECT INTO statements, wherein you may put your transformation logic in the select component to load data directly form S3 to RedShift.

redshift spectrum best practices

redshift best practices

redshift table design playbook

John Rotenstein · Accepted Answer · 2019-01-13 23:58:00Z

0

It now appears that your source data is stored in Amazon S3 and you have been using a Redshift Spectrum table (that points to data in S3) as your source.

The preferred method would be:

Use the Amazon Redshift COPY command to load the data into a Redshift table
Use a CREATE TABLE AS command to extract (ETL) the data from the new Redshift table into your desired table. If you do this on a regular basis, you can use TRUNCATE and INSERT INTO to reload the table in future.

answered Jan 13, 2019 at 23:58

John Rotenstein

273k28 gold badges456 silver badges541 bronze badges

Collectives™ on Stack Overflow

Loading data from AWS redshift using python

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related