1

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-4 has 32GB memory and 4 cores. Each VM has one executor. 22GB is allocated per executor. ie 22*15=330GB overall executor memory, which seems to be large enough for ~55GB input data. shuffle partition is set to 200. But Im getting OOM error.

Shuffle partition count = 200 for 15 workers. That’s roughly ~13 partitions (200/15) per executor. Given each executor(or VM) has 4 core, 4 tasks run in parallel.

Only single stage has failed and task details from failed job is as below.

enter image description here

Could you please help understand why this is not sufficient leading to oom? Also is it necessary for all ~13 partitions assigned to an executor to fit in memory at once or since only 4 tasks run in parallel per executor, is it sufficient for memory to accommodate just 4 partitions at a time?

2
  • please provide the query and code sample that is causing the issue and where the oome is happening, without that it's just guesswork. e.g. you are using .collect Commented May 5 at 16:23
  • The codebase is large, so I'm unable to share it here, sorry. However, I'm not using collect() or any other known anti-patterns. Primarily looking to understand the potential reasons for the OOM issue based on the data size and cluster configs shared above for general insight and troubleshooting. Commented May 5 at 16:26

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.