0

I want to create a UDF for PySpark based on some Java code. The UDF signature is quite similar to regex match. The first argument will come from data frames, while the second will be the same. The problem here is like in regex, it is time consuming to parse regex every time, so it can be cached. In my case, the second argument is even more heavier to parse, then regex. It is a DSL represented by JSON. How can I do this caching?

One of the idea, is to maintain a static cache on Java side, generate some ID on driver side, add my JSON to caches on each worker with generated ID, than pass that ID, to UDF, so it can access already parsed JSON inside. How can I achieve this? Or may be there are some other variants?

2
  • 2
    add some code to this post. Some example of input data and the udf you've written so far. Commented Dec 17, 2024 at 21:43
  • I am assuming the two arguments to your UDF are commonly present. Here's approach. partition the dataframe based on those arguments. Then use mapPartition to initialize the two regexs or computations based on those arguments commont to that partition and then process row by row. Here's my answer which shows how to do ML model initialization, a complex operation per partition and then use it to evaluate that parition. stackoverflow.com/a/77033826/3238085 Commented Dec 23, 2024 at 16:35

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.