I have a BinaryType() - column in a Pyspark DataFrame which i can convert to an ArrayType() column using the following UDF:
@udf(returnType=ArrayType(FloatType()))
def array_from_bytes(bytes):
return np.frombuffer(bytes,np.float32).tolist()
but i wonder if there is a more "spark-y"/built-in/non-UDF way to convert the types? Is there a "general" way to get the BinaryType() into an ArrayType()? I tried different variations of .cast(), but none of them did succeed.
I´m asking because i have two concerns with the current approach:
- i need to know beforehand that the "frombuffer" function is the one that needs to be used
- probably the UDF is not the optimal way to do it from a performance perspective(?)
The BinaryType() column is created by reading it from a JSON file; in the JSON it is stored as a Base64 encoded string.