4

In the book "Spark: The definitive guide" (currently early release, text might change), the authors advise against the use of Pyspark for user-defined functions in Spark:

"Starting up this Python process is expensive but the real cost is in serializing the data to Python. This is costly for two reasons, it is an expensive computation but also once the data enters Python, Spark cannot manage the memory of the worker. This means that you could potentially cause a worker to fail if it becomes resource constrained (because both the JVM and python are competing for memory on the same machine)."

I understand that the competition for worker node resources between Python and the JVM can be a serious problem. But doesn't that also apply to the driver? In this case, it would be an argument against using Pyspark at all. Could anyone please explain what makes the situation different on the driver?

2 Answers 2

1

If anything, this is more an argument against using Python UDFs than PySpark in general and to lesser extent, a similar argument can be made against native (implemented on JVM) UDFs.

You should also note that vectorized UDFs are on the Spark road map, so:

the real cost is in serializing the data to Python

might no longer be concern in the future.

But doesn't that also apply to the driver?

Not so much. While sharing the resources of a single node is always a concern (think about co-location of additional services) the problem of UDFs is quite specific - the same data has to be stored in two different contexts at the same time.

If you for example opt in for RDD API, JVM serves mostly a communication layer and the overhead is significantly smaller. Therefore it is much more natural choice for native Python computations, albeit you may find some better suited, native Python tools out there.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, the authors refer exclusively to UDFs, and I tried to indicate that in the question. As for your remark on the RDD API: We still store the same data in two different contexts when we use it, don't we?
0

In your driver application, you don't necessarily have to collect a ton of records. Maybe you're just doing a reduce down to some statistics.

This is just typical behavior: Drivers usually deal with statistical results. Your mileage may vary.

On the other hand, Spark applications typically use the executors to read in as much data as their memory allows and process it. So memory management is almost always a concern.

I think this is the distinction the book is getting at.

1 Comment

When I first read that passage in the book, I had a similar thought. But it's this very argument that often leads to the decision to use a smaller machine as the driver. It seems to me that it's not so much a question of how much memory is used (of course, that's important, too), but whether I'm able to control the memory usage and prevent all machines in the cluster from becoming resource-constrained or not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.