apply function to all values in array column pyspark

Question

I want to make all values in an array column in my pyspark data frame negative without exploding (!). I tried this udf but it didn't work:

negative = func.udf(lambda x: x * -1, T.ArrayType(T.FloatType()))
cast_contracts = cast_contracts \
    .withColumn('forecast_values', negative('forecast_values'))

Can someone help?

Example data frame:

df = sc..parallelize(
   [Row(name='Joe', forecast_values=[1.0,2.0,3.0]),
    Row(name='Mary', forecast_values=[4.0,7.1])]).toDF()
>>> df.show()
    +----+---------------+
    |name|forecast_values|
    +----+---------------+
    | Joe|[1.0, 2.0, 3.0]|
    |Mary|     [4.0, 7.1]|
    +----+---------------+

Thanks

negative = func.udf(lambda x: [i * -1 for i in x], T.ArrayType(T.FloatType()))?? — pissall
– pissall, Commented Oct 22, 2019 at 12:38

mrammah · Accepted Answer · 2020-12-13 22:49:06Z

8

I know this is a year old post and so the solution I'm about to give may not have been an option previously (it's new to Spark 3). If you're using spark 3.0 and above in the PySpark API, you should consider using spark.sql.function.transform inside pyspark.sql.functions.expr. Please don't confuse spark.sql.function.transform with PySpark's transform() chaining. At any rate, here is the solution:

df.withColumn("negative", F.expr("transform(forecast_values, x -> x * -1)"))

Only thing you need to make sure is convert the values to int or float. The approach highlighted is much more efficient than exploding array or using UDFs.

answered Dec 13, 2020 at 22:49

mrammah

2251 gold badge3 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Steven Over a year ago

available from 2.4

Michael Gardner Over a year ago

A more pythonic way would be: df.withColumn("negative", F.transform(F.col('forecast_values'), lambda x: x * -1))

pissall · Accepted Answer · 2019-10-22 13:37:10Z

5

It's just that you're not looping over the list values to multiply them with -1

import pyspark.sql.functions as F
import pyspark.sql.types as T

negative = F.udf(lambda x: [i * -1 for i in x], T.ArrayType(T.FloatType()))
cast_contracts = df \
    .withColumn('forecast_values', negative('forecast_values'))

You cannot escape the udf but the best possible way to do this. Works better if you have large lists:

import numpy as np

negative = F.udf(lambda x: np.negative(x).tolist(), T.ArrayType(T.FloatType()))
cast_contracts = abdf \
    .withColumn('forecast_values', negative('forecast_values'))
cast_contracts.show()
+------------------+----+
|   forecast_values|name|
+------------------+----+
|[-1.0, -2.0, -3.0]| Joe|
|            [-3.0]|Mary|
|      [-4.0, -7.1]|Mary|
+------------------+----+

edited Oct 22, 2019 at 13:37

user12231553

answered Oct 22, 2019 at 12:45

pissall

7,4442 gold badges29 silver badges47 bronze badges

3 Comments

LN_P Over a year ago

Thanks. This returns an array of nulls. Maybe my array is an array of strings and I need to convert it to float first. Also it seems that my runtime increased by 12 minutes. Do you reckon this can be due just to the udf?

pissall Over a year ago

@LN_P Yes, UDF will spoil your performance but there is no inbuilt functionality to operate on array type columns. How many rows are you working with?

pissall Over a year ago

negative = F.udf(lambda x: [float(i) * -1 for i in x], T.ArrayType(T.FloatType())) if it's string

Collectives™ on Stack Overflow

apply function to all values in array column pyspark

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related