1

I have a BinaryType() - column in a Pyspark DataFrame which i can convert to an ArrayType() column using the following UDF:

@udf(returnType=ArrayType(FloatType()))
def array_from_bytes(bytes):
    return np.frombuffer(bytes,np.float32).tolist()

but i wonder if there is a more "spark-y"/built-in/non-UDF way to convert the types? Is there a "general" way to get the BinaryType() into an ArrayType()? I tried different variations of .cast(), but none of them did succeed.

I´m asking because i have two concerns with the current approach:

  1. i need to know beforehand that the "frombuffer" function is the one that needs to be used
  2. probably the UDF is not the optimal way to do it from a performance perspective(?)

The BinaryType() column is created by reading it from a JSON file; in the JSON it is stored as a Base64 encoded string.

1 Answer 1

1

You can check if the pandas udf function optimizes the UDF execution time: PySpark Usage Guide for Pandas with Apache Arrow

PyArrow library needs to be installed and the below spark configuration needs to be set:

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

The change required would be using pandas_udf as the decorator

@pandas_udf(returnType=ArrayType(FloatType()))
def array_from_bytes(bytes):
    return np.frombuffer(bytes,np.float32).tolist()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.