1

I have a Google Cloud Data Fusion streaming pipeline that receives data from Google Pub/Sub. Micro-batching is performed every 5 seconds. Since data doesn’t always arrive consistently, I see many Spark Batches with 0 records, which still take processing time and end up queued. Is there a way in Google Data Fusion to configure the pipeline so that, in the absence of data in a micro-batch, it doesn't trigger a Spark Job? And only trigger a Spark job when the micro-batch contains data.

Google Cloud Dataproc Spark UI:

enter image description here

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.