0

I'm solving a problem using step function workflows.

The problem goes like this, I have a worklow of 10 AWS Batch jobs.

The first 3 jobs run in sequence and the 4-7 jobs are dynamic steps i.e, they needs to run multiple times with different parameters as specified.

And for each 4-5-6-7 workflow, there are multiple executions of 8-9-10 jobs based on number of parameters.

Looks like Map is best possible fit here but if any of the job fails in map state of 4-5-6-7 entire step fails. I don't want one execution to effect the other execution.

Approach: I have designed 3 step functions. First step function runs 1-3 jobs and last step calls for a lambda function which submits multiple executions of 4-5-6-7 jobs. And for each 4-5-6-7 execution another lambda gets triggered to submit multiple executions of 8-9-10 jobs.

I'm connecting the step functions manually through lambda functions.

Is this the correct approach or are there better ways of doing it?

1 Answer 1

2

I'd suggest a couple more elements to make your solution more production-ready.

First, I would suggest that you eliminate the Lambda function calls and use the "Run a Job" (.sync:2) service integration for Nested workflows. I just did a Twitch episode on this yesterday.

Second, if you want to continue after a failed execution inside your Map State, make sure that you are implementing Catchers (and optionally Retriers). I did a Twitch episode on this last Tuesday, and there's some discussion of error handling in the first video linked above.

So for your specific case, I suggest you:

  1. start by making the 8-9-10 steps into an independent workflow (Child A)
  2. invoke Child A from steps 4-5-6-7 via the "Run a Job" service integration inside a Map State
  3. migrate steps 4-5-6-7 into an independent workflow (Child B)
  4. invoke Child B from the parent workflow (steps 1-2-3), again via the "Run a Job" service integration

For more information on parallelism in Step Functions and Lambda functions, see this Twitch episode.

Code samples for all of the above are available in this repo on GitHub.


I am contributing this on behalf of my employer, Amazon. My contribution is licensed under the MIT license. See here for a more detailed explanation.

Sign up to request clarification or add additional context in comments.

5 Comments

That's a good suggestion, but in case of a failure in one of the map states how do we rerun the only the failed States of the step function. Or even in general how do we rerun failed states without executing the already executed states
This link shows a way to restart step function from any state aws.amazon.com/blogs/compute/…. If this is the only option, do we need to rerun the entire map state
You add the retrier/catcher inside the Map State Iterator. That way it is only run for iterations that do not succeed.
Assume if there is some bug in the 6th batch job in 4-5-6-7. That job failed and it fails even after retries and may be we would catch it. But how can we make sure it runs only failed jobs and not all when we fix the issue.
You could use a catcher to send failed jobs to a dead letter queue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.