I have a Google Cloud Data Fusion streaming pipeline that receives data from Google Pub/Sub. Micro-batching is performed every 5 seconds. Since data doesn’t always arrive consistently, I see many Spark Batches with 0 records, which still take processing time and end up queued. Is there a way in Google Data Fusion to configure the pipeline so that, in the absence of data in a micro-batch, it doesn't trigger a Spark Job? And only trigger a Spark job when the micro-batch contains data.
Google Cloud Dataproc Spark UI:
