2

I wonder if Apache Beam.Google DataFlow is smart enough to recognize repeated transformations in the dataflow graph and run them only once. For example, if I have 2 branches:

  • p | GroupByKey() | FlatMap(...)
  • p | combiners.Top.PerKey(...) | FlatMap(...)

both will involve grouping elements by key under the hood. Will the execution engine recognize that GroupByKey() has the same input in both cases and run it only once? Or do I need to manually ensure that GroupByKey() in this case proceeds all branches where it gets used?

1 Answer 1

2

As you may have inferred, this behavior is runner-dependent. Each runner implements its own optimization logic.

  • The Dataflow Runner does not currently support this optimization.
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you, Pablo. May I ask if this optimization is on the roadmap for the DataFlow runner?
@00111100b, you might want to give a star to this Feature Request in order to prioritize it and maybe open your own one asking for whatever you consider.
@Temu thank you. I am getting "access denied" trying to follow your link. Is that expected? I am logged in with my Gmail account.
You are right, I sent you the wrong link. There is the good one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.