pyspark: Convert BinaryType column to ArrayType(FloatType())

Question

I have a BinaryType() - column in a Pyspark DataFrame which i can convert to an ArrayType() column using the following UDF:

@udf(returnType=ArrayType(FloatType()))
def array_from_bytes(bytes):
    return np.frombuffer(bytes,np.float32).tolist()

but i wonder if there is a more "spark-y"/built-in/non-UDF way to convert the types? Is there a "general" way to get the BinaryType() into an ArrayType()? I tried different variations of .cast(), but none of them did succeed.

I´m asking because i have two concerns with the current approach:

i need to know beforehand that the "frombuffer" function is the one that needs to be used
probably the UDF is not the optimal way to do it from a performance perspective(?)

The BinaryType() column is created by reading it from a JSON file; in the JSON it is stored as a Base64 encoded string.

kirtan_shah · Accepted Answer · 2022-02-14 13:29:16Z

1

You can check if the pandas udf function optimizes the UDF execution time: PySpark Usage Guide for Pandas with Apache Arrow

PyArrow library needs to be installed and the below spark configuration needs to be set:

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

The change required would be using pandas_udf as the decorator

@pandas_udf(returnType=ArrayType(FloatType()))
def array_from_bytes(bytes):
    return np.frombuffer(bytes,np.float32).tolist()

answered Feb 14, 2022 at 13:29

kirtan_shah

788 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark: Convert BinaryType column to ArrayType(FloatType())

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related