Spark Dataframes: appending list in a new column(python)

Question

I am having the following dataframe(values inside array are strings):

+--------------------+--------------------+
|                col1|                col2|
+--------------------+--------------------+
|    [value1, value2]|     [value3,value4]|
|            [value5]|            [value6]|
+--------------------+--------------------+

How can I create an new column with a new array including all the values of both

+--------------------+--------------------+------------------------------+
|                col1|                col2|                          new |
+--------------------+--------------------+------------------------------+
|    [value1, value2]|     [value3,value4]|[value1, value2,value3,value4]|
|            [value5]|            [value6]|               [value5,value6]|
+--------------------+--------------------+------------------------------+

I tried the following:

def add_function(col1,col2):
    return col1+col2

udf_add = udf(add_function,ArrayType(StringType()))
dftrial.withColumn("new",udf_add("col1","col2")).show(2)

It does do the task as desired. But I dont understand why when I modify the add_function to:

def add_function(col1,col2):
     return col1.extend(col2)

It returns null value. Why?

And my main question: Is there another way to implement this task, Any already implemented function? I found concat but it seems that it works only for strings.

Community · Accepted Answer · 2017-05-23 11:44:07Z

1

Why wouldn't it? Using Python type hints list.extend is:

list.extend(iterable) -> None

So you get exactly what is returned from extend. If you wanted to return modify collection you should actually return col1 but please don't because there is actually a worse part here.

You should never modify data in place when working with Spark. While in this particular scenario you're safe it can have unpredictable consequences. You can find possible example in my answer to Will there be any scenario, where Spark RDD's fail to satisfy immutability.?. While PySpark is relatively insulated from behaviors like this it is only an implementation detail and not something you can depend on in general.

edited May 23, 2017 at 11:44

CommunityBot

11 silver badge

answered May 12, 2016 at 9:56

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mpizos Dimitris Over a year ago

Thanks for your answer. Can you clarify what I should never do? Because i didnt understand it exactly and it seems to be an important information that I am missing.

zero323 Over a year ago

list.extend modifies (mutates) existing list. Don't do this with your data. Always return a new object unless it is explicitly allowed to do otherwise (see RDD.fold, RDD.aggregate, etc.)

Carlos Vilchez · Accepted Answer · 2016-05-12 10:34:18Z

0

I agree with @zero323. I just wanted to add the transformation that would be necessary to get the solution in a new dataframe.

  val updatedDataframe = initialDataframe.map {
    case Row(col1: Seq[String], col2: Seq[String]) => (col1, col2, col1.union(col2))
  }.toDF("col1", "col2", "col3")

answered May 12, 2016 at 10:34

Carlos Vilchez

2,80431 silver badges32 bronze badges

Collectives™ on Stack Overflow

Spark Dataframes: appending list in a new column(python)

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related