2

I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low

My question is

  1. This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
  2. can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.

Thank you

2 Answers 2

3

A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.

If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:

  1. launches the first Lambda (initiates the COPY)
  2. has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
  3. Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda

The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.

Sign up to request clarification or add additional context in comments.

2 Comments

What if COPY command lasts more than 15 min and Lambda timesout, how can we solve this problem?
Sounds like you may not be using Redshift Data API - docs.aws.amazon.com/redshift/latest/mgmt/data-api.html Or using in a way I'm not expecting. Your Lambda only needs to issue the SQL (COPY) and Redshift Data API will see it through until completion. You can use a service like Step Functions to poll for completion (successful or unsuccessful).
1
  1. With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
  2. I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.

a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).

b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.

Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.

ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.

Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.

Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.