OperationalError when using TriggerDagRunOperator

Hi, I’m newly setting up Cloud Composer on composer-3-airflow-2.10.2-build.12 and have encountered some errors triggering DAGs using the TriggerDagRunOperator.

The stack trace is lengthy, but appears quite similar to this issue at its core — psycopg2.OperationalError: connection to server at "localhost", port 3306 failed: server closed the connection unexpectedly.

Based on the reading/research I’ve done so far I think it might be rooted in the setup for my operator, namely that wait_for_competion=True. Here’s the failing task:

task = TriggerDagRunOperator(
        task_id='task_id',
        trigger_dag_id='task_trigger_dag_id',
        wait_for_completion=True,
        poke_interval=30,
        failed_states=['failed']
    )

I’m relatively new to Airflow/Cloud Composer/GCP generally, so any pointers on where to look to solve this or how to fix this would be great! I’ve considered using an ExternalTaskSensor to dodge the wait_for_completion, but that adds a good chunk of overhead per DAG triggered so I’d rather avoid if possible.

Hi @jeffreyp14 ,

Welcome to Google Cloud Community!

In your case, your airflow database might be under heavy load. Since the TriggerDagRunOperator with parameter “wait_for_completion=True” triggers a DAG run and waits for other tasks to finish its execution, the waiting process can lead to long running tasks that could result in resource limitation.

To solve the issue you have encountered, try to check and apply the possible solutions provided in Troubleshooting DAGs that might be helpful based on your scenario.

If the issue persists, I’d recommend reaching out to Google Cloud Support. When reaching out, include detailed information and relevant screenshots of the errors you’ve encountered. This will assist them in diagnosing and resolving your issue more efficiently.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

1 Like

For anyone looking for this in the future, solution in my case was to boost the worker memory size — the psycopg2 error was leading me to think it was related to the database, but it wasn’t!

Thanks a lot for that feedback. I had the same issue and was out of hypotheses. I had even excluded OOM error since monitoring showed no sign of saturation and that I could not find a SIGKILL or anything leading to a memory issue in logs.