Python/Pyspark iteration code (for AWS Glue ETL job)

Question

I am using AWS Glue and you cannot read/write multiple dynamic frame without using an iteration. I made this code below but am struggling on 2 things:

Is "tableName" i.e. the filtered list of tables correct (all the tables I want to iterate on start with client_historical_*).
I am stuck on how to dynamically populate the Redshift table name using the mapping below.

Redshift mappings:

client_historical_ks --> table_01_a
client_historical_kg --> table_01_b
client_historical_kt --> table_01_c
client_historical_kf --> table_01_d

Code:

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']

for table in tableList:
    start_prefix = client_historical_
    tableName = list(filter(lambda x: x.startswith(start_prefix), table['Name']))
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = tableName, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": "nameoftablehere", "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

start_prefix = client_historical_ ... can you put this in quotes (start_prefix = 'client_historical_') and try. By the way whats the result of this code is it working or not working? If not working, whats the error you are getting, pls add more info — Yuva
– Yuva, Commented May 29, 2020 at 6:09

Shubham Jain · Accepted Answer · 2020-05-29 06:19:05Z

1

You can create a mapping dictionary and then execute your code You can also filter the tables outside of loop and then loop over only on required tables.

mapping = {'client_historical_ks': 'table_01_a',
'client_historical_kg': 'table_01_b',
'client_historical_kt': 'table_01_c',
'client_historical_kf': 'table_01_d'}

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']
start_prefix = 'client_historical_'
tableNames = list(filter(lambda x: x.startswith(start_prefix), table['Name']))

for table in tableNames:
    target_table = mapping.get(table)
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = table, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": target_table, "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

answered May 29, 2020 at 6:19

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

charlesperry Over a year ago

I am getting an error in Glue when I try this: raise ConnectTimeoutError(endpoint_url=request.url, error=e) botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "glue.us-east-1.amazonaws.com". Seems like boto3 isn't working as expected?

Shubham Jain Over a year ago

Boto3 doesn't works in spark shell of glue jobs. Alternatively you can create a lambda which gets you the table list then call Glue job from that lambda passing table list as parameter

charlesperry Over a year ago

I need to use Glue unfortunately because data catalog needs to be setup. :( Is the alternative making single scripts?

Collectives™ on Stack Overflow

Python/Pyspark iteration code (for AWS Glue ETL job)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related