1

I understand that when vectorization is involved, pyspark.sql.functions.pandas_udf will be faster than pyspark.sql.functions.udf.

But what if vectorization isn't involved, are the two supposed to be similar in performance? Is there any guideline for choosing between the two?

1 Answer 1

3

Pandas UDFs should be faster in the most cases, primarily because of the more effective encoding of data between Spark JVM and Python process, so it's recommended to use Pandas UDFs as much as possible.

The "normal" UDFs could be used in case when Pandas UDFs couldn't be used, for example, right now they don't work with MapType, arrays of TimestampType, and nested StructType.

P.S. Also, when using PySpark, maybe it makes sense to evaluate a use of Koalas, In my own tests, Koalas was ~2 times faster than similar code that used Pandas UDFs, although carefully written PySpark code was still faster.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.