How to choose between pyspark.sql.functions.pandas_udf and pyspark.sql.functions.udf?

Question

I understand that when vectorization is involved, pyspark.sql.functions.pandas_udf will be faster than pyspark.sql.functions.udf.

But what if vectorization isn't involved, are the two supposed to be similar in performance? Is there any guideline for choosing between the two?

Alex Ott · Accepted Answer · 2020-12-19 15:04:35Z

3

Pandas UDFs should be faster in the most cases, primarily because of the more effective encoding of data between Spark JVM and Python process, so it's recommended to use Pandas UDFs as much as possible.

The "normal" UDFs could be used in case when Pandas UDFs couldn't be used, for example, right now they don't work with MapType, arrays of TimestampType, and nested StructType.

P.S. Also, when using PySpark, maybe it makes sense to evaluate a use of Koalas, In my own tests, Koalas was ~2 times faster than similar code that used Pandas UDFs, although carefully written PySpark code was still faster.

answered Dec 19, 2020 at 15:04

Alex Ott

88.1k10 gold badges110 silver badges157 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to choose between pyspark.sql.functions.pandas_udf and pyspark.sql.functions.udf?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related