8

I want to make all values in an array column in my pyspark data frame negative without exploding (!). I tried this udf but it didn't work:

negative = func.udf(lambda x: x * -1, T.ArrayType(T.FloatType()))
cast_contracts = cast_contracts \
    .withColumn('forecast_values', negative('forecast_values'))

Can someone help?

Example data frame:

df = sc..parallelize(
   [Row(name='Joe', forecast_values=[1.0,2.0,3.0]),
    Row(name='Mary', forecast_values=[4.0,7.1])]).toDF()
>>> df.show()
    +----+---------------+
    |name|forecast_values|
    +----+---------------+
    | Joe|[1.0, 2.0, 3.0]|
    |Mary|     [4.0, 7.1]|
    +----+---------------+

Thanks

1
  • negative = func.udf(lambda x: [i * -1 for i in x], T.ArrayType(T.FloatType()))?? Commented Oct 22, 2019 at 12:38

2 Answers 2

8

I know this is a year old post and so the solution I'm about to give may not have been an option previously (it's new to Spark 3). If you're using spark 3.0 and above in the PySpark API, you should consider using spark.sql.function.transform inside pyspark.sql.functions.expr. Please don't confuse spark.sql.function.transform with PySpark's transform() chaining. At any rate, here is the solution:

df.withColumn("negative", F.expr("transform(forecast_values, x -> x * -1)"))

Only thing you need to make sure is convert the values to int or float. The approach highlighted is much more efficient than exploding array or using UDFs.

Sign up to request clarification or add additional context in comments.

2 Comments

available from 2.4
A more pythonic way would be: df.withColumn("negative", F.transform(F.col('forecast_values'), lambda x: x * -1))
5

It's just that you're not looping over the list values to multiply them with -1

import pyspark.sql.functions as F
import pyspark.sql.types as T

negative = F.udf(lambda x: [i * -1 for i in x], T.ArrayType(T.FloatType()))
cast_contracts = df \
    .withColumn('forecast_values', negative('forecast_values'))

You cannot escape the udf but the best possible way to do this. Works better if you have large lists:

import numpy as np

negative = F.udf(lambda x: np.negative(x).tolist(), T.ArrayType(T.FloatType()))
cast_contracts = abdf \
    .withColumn('forecast_values', negative('forecast_values'))
cast_contracts.show()
+------------------+----+
|   forecast_values|name|
+------------------+----+
|[-1.0, -2.0, -3.0]| Joe|
|            [-3.0]|Mary|
|      [-4.0, -7.1]|Mary|
+------------------+----+

3 Comments

Thanks. This returns an array of nulls. Maybe my array is an array of strings and I need to convert it to float first. Also it seems that my runtime increased by 12 minutes. Do you reckon this can be due just to the udf?
@LN_P Yes, UDF will spoil your performance but there is no inbuilt functionality to operate on array type columns. How many rows are you working with?
negative = F.udf(lambda x: [float(i) * -1 for i in x], T.ArrayType(T.FloatType())) if it's string

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.