Spark: Dangers of using Python

Question

In the book "Spark: The definitive guide" (currently early release, text might change), the authors advise against the use of Pyspark for user-defined functions in Spark:

"Starting up this Python process is expensive but the real cost is in serializing the data to Python. This is costly for two reasons, it is an expensive computation but also once the data enters Python, Spark cannot manage the memory of the worker. This means that you could potentially cause a worker to fail if it becomes resource constrained (because both the JVM and python are competing for memory on the same machine)."

I understand that the competition for worker node resources between Python and the JVM can be a serious problem. But doesn't that also apply to the driver? In this case, it would be an argument against using Pyspark at all. Could anyone please explain what makes the situation different on the driver?

Rick Moritz · Accepted Answer · 2017-12-08 12:21:55Z

1

If anything, this is more an argument against using Python UDFs than PySpark in general and to lesser extent, a similar argument can be made against native (implemented on JVM) UDFs.

You should also note that vectorized UDFs are on the Spark road map, so:

the real cost is in serializing the data to Python

might no longer be concern in the future.

But doesn't that also apply to the driver?

Not so much. While sharing the resources of a single node is always a concern (think about co-location of additional services) the problem of UDFs is quite specific - the same data has to be stored in two different contexts at the same time.

If you for example opt in for RDD API, JVM serves mostly a communication layer and the overhead is significantly smaller. Therefore it is much more natural choice for native Python computations, albeit you may find some better suited, native Python tools out there.

edited Dec 8, 2017 at 12:21

Rick Moritz

1,51812 silver badges25 bronze badges

answered Sep 22, 2017 at 8:15

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MightyCurious Over a year ago

Yes, the authors refer exclusively to UDFs, and I tried to indicate that in the question. As for your remark on the RDD API: We still store the same data in two different contexts when we use it, don't we?

DoctorPangloss · Accepted Answer · 2017-09-22 06:21:39Z

0

In your driver application, you don't necessarily have to collect a ton of records. Maybe you're just doing a reduce down to some statistics.

This is just typical behavior: Drivers usually deal with statistical results. Your mileage may vary.

On the other hand, Spark applications typically use the executors to read in as much data as their memory allows and process it. So memory management is almost always a concern.

I think this is the distinction the book is getting at.

answered Sep 22, 2017 at 6:21

DoctorPangloss

3,0731 gold badge21 silver badges23 bronze badges

1 Comment

MightyCurious Over a year ago

When I first read that passage in the book, I had a similar thought. But it's this very argument that often leads to the decision to use a smaller machine as the driver. It seems to me that it's not so much a question of how much memory is used (of course, that's important, too), but whether I'm able to control the memory usage and prevent all machines in the cluster from becoming resource-constrained or not.

Collectives™ on Stack Overflow

Spark: Dangers of using Python

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related