In the book "Spark: The definitive guide" (currently early release, text might change), the authors advise against the use of Pyspark for user-defined functions in Spark:
"Starting up this Python process is expensive but the real cost is in serializing the data to Python. This is costly for two reasons, it is an expensive computation but also once the data enters Python, Spark cannot manage the memory of the worker. This means that you could potentially cause a worker to fail if it becomes resource constrained (because both the JVM and python are competing for memory on the same machine)."
I understand that the competition for worker node resources between Python and the JVM can be a serious problem. But doesn't that also apply to the driver? In this case, it would be an argument against using Pyspark at all. Could anyone please explain what makes the situation different on the driver?