Stateful Java UDF in PySpark

I want to create a UDF for PySpark based on some Java code. The UDF signature is quite similar to regex match. The first argument will come from data frames, while the second will be the same. The problem here is like in regex, it is time consuming to parse regex every time, so it can be cached. In my case, the second argument is even more heavier to parse, then regex. It is a DSL represented by JSON. How can I do this caching?

One of the idea, is to maintain a static cache on Java side, generate some ID on driver side, add my JSON to caches on each worker with generated ID, than pass that ID, to UDF, so it can access already parsed JSON inside. How can I achieve this? Or may be there are some other variants?

asked Dec 17, 2024 at 10:17

Yaroslav Kishchenko

4831 gold badge4 silver badges17 bronze badges

2

add some code to this post. Some example of input data and the udf you've written so far.

Kashyap
– Kashyap

2024-12-17 21:43:56 +00:00
Commented Dec 17, 2024 at 21:43
I am assuming the two arguments to your UDF are commonly present. Here's approach. partition the dataframe based on those arguments. Then use mapPartition to initialize the two regexs or computations based on those arguments commont to that partition and then process row by row. Here's my answer which shows how to do ML model initialization, a complex operation per partition and then use it to evaluate that parition. stackoverflow.com/a/77033826/3238085

user238607
– user238607

2024-12-23 16:35:04 +00:00
Commented Dec 23, 2024 at 16:35

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Stateful Java UDF in PySpark

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked