I need to implement a dynamic "bring-your-own-code" function for registering UDFs that are created from outside my own code. This is containerized and the entrypoint is a standard python interpreter (not pypsark). Based upon config settings at startup, the spark container would initialize itself with something like the below. We don't know ahead of time the function definition, but we can pre-install dependencies, if needed, on the container.
def register_udf_module(udf_name, zip_or_py_path, file_name, function_name, return_type="int"):
# Psueduocode:
global sc, spark
sc.addPyFile(zip_or_py_path)
module_ref = some_inspect_function_1(zip_or_py_path)
file_ref = module_ref[file_name]
function_ref = module_ref[function_name]
spark.udf.register(udf_name, function_ref, return_type)
I can't seem to find any reference for how to accomplish this. And specifically, the use case is that after initializing the spark cluster by running this, users would then need this UDF available for SQL functions (via Thrift JDBC connection). There is no interface I know of between the JDBC/SQL connection and the ability to register the UDF, so it has to be up and running for SQL queries, and I can't later expect the user will call 'spark.udf.register' on their side.