I'm facing a mission impossible to extract a huge amount of data from Amazone Redshift to another table. It definitely requires a more efficient approach but I'm new to SQL and AWS so decided to ask this smart community for advice.
This is my initial SQL query which takes forever:
-- STEP 1: CREATE A SAMPLE FOR ONE MONTH
SELECT DISTINCT at_id, utc_time, name
INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
What would be the best approach? I was thinking of using python and sqlalchemy to create dataframes with chunks of 1m rows and inserting it back into the new table (which I need to create beforehand). Would this work?:
from sqlalchemy import create_engine
import os
import pandas as pd
redshift_user = os.environ['REDSHIFT_USER']
redshift_password = os.environ['REDSHIFT_PASSWORD']
engine_string = "postgresql+psycopg2://%s:%s@%s:%d/%s" \
% (redshift_user, redshift_password, 'localhost', XXXX, 'redshiftdb')
engine = create_engine(engine_string)
for df in pd.read_sql_query("""
SELECT DISTINCT at_id, utc_time, name
INSERT INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
""", engine, chunksize=1000000):