0

I am using AWS Glue and you cannot read/write multiple dynamic frame without using an iteration. I made this code below but am struggling on 2 things:

  1. Is "tableName" i.e. the filtered list of tables correct (all the tables I want to iterate on start with client_historical_*).
  2. I am stuck on how to dynamically populate the Redshift table name using the mapping below.

Redshift mappings:

client_historical_ks --> table_01_a
client_historical_kg --> table_01_b
client_historical_kt --> table_01_c
client_historical_kf --> table_01_d

Code:

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']

for table in tableList:
    start_prefix = client_historical_
    tableName = list(filter(lambda x: x.startswith(start_prefix), table['Name']))
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = tableName, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": "nameoftablehere", "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
1
  • start_prefix = client_historical_ ... can you put this in quotes (start_prefix = 'client_historical_') and try. By the way whats the result of this code is it working or not working? If not working, whats the error you are getting, pls add more info Commented May 29, 2020 at 6:09

1 Answer 1

1

You can create a mapping dictionary and then execute your code You can also filter the tables outside of loop and then loop over only on required tables.

mapping = {'client_historical_ks': 'table_01_a',
'client_historical_kg': 'table_01_b',
'client_historical_kt': 'table_01_c',
'client_historical_kf': 'table_01_d'}

client = boto3.client('glue',region_name='us-east-1')

databaseName = 'incomingdata'
tables = client.get_tables(DatabaseName = databaseName)
tableList = tables['TableList']
start_prefix = 'client_historical_'
tableNames = list(filter(lambda x: x.startswith(start_prefix), table['Name']))

for table in tableNames:
    target_table = mapping.get(table)
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "incomingdata", table_name = table, transformation_ctx = "datasource0")
    datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource0, catalog_connection = "Redshift", connection_options = {"dbtable": target_table, "database": "metadata"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

Sign up to request clarification or add additional context in comments.

3 Comments

I am getting an error in Glue when I try this: raise ConnectTimeoutError(endpoint_url=request.url, error=e) botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "glue.us-east-1.amazonaws.com". Seems like boto3 isn't working as expected?
Boto3 doesn't works in spark shell of glue jobs. Alternatively you can create a lambda which gets you the table list then call Glue job from that lambda passing table list as parameter
I need to use Glue unfortunately because data catalog needs to be setup. :( Is the alternative making single scripts?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.