0

I am having the following dataframe(values inside array are strings):

+--------------------+--------------------+
|                col1|                col2|
+--------------------+--------------------+
|    [value1, value2]|     [value3,value4]|
|            [value5]|            [value6]|
+--------------------+--------------------+

How can I create an new column with a new array including all the values of both

+--------------------+--------------------+------------------------------+
|                col1|                col2|                          new |
+--------------------+--------------------+------------------------------+
|    [value1, value2]|     [value3,value4]|[value1, value2,value3,value4]|
|            [value5]|            [value6]|               [value5,value6]|
+--------------------+--------------------+------------------------------+

I tried the following:

def add_function(col1,col2):
    return col1+col2

udf_add = udf(add_function,ArrayType(StringType()))
dftrial.withColumn("new",udf_add("col1","col2")).show(2)

It does do the task as desired. But I dont understand why when I modify the add_function to:

def add_function(col1,col2):
     return col1.extend(col2)

It returns null value. Why?

And my main question: Is there another way to implement this task, Any already implemented function? I found concat but it seems that it works only for strings.

2 Answers 2

1

Why wouldn't it? Using Python type hints list.extend is:

list.extend(iterable) -> None

So you get exactly what is returned from extend. If you wanted to return modify collection you should actually return col1 but please don't because there is actually a worse part here.

You should never modify data in place when working with Spark. While in this particular scenario you're safe it can have unpredictable consequences. You can find possible example in my answer to Will there be any scenario, where Spark RDD's fail to satisfy immutability.?. While PySpark is relatively insulated from behaviors like this it is only an implementation detail and not something you can depend on in general.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer. Can you clarify what I should never do? Because i didnt understand it exactly and it seems to be an important information that I am missing.
list.extend modifies (mutates) existing list. Don't do this with your data. Always return a new object unless it is explicitly allowed to do otherwise (see RDD.fold, RDD.aggregate, etc.)
0

I agree with @zero323. I just wanted to add the transformation that would be necessary to get the solution in a new dataframe.

  val updatedDataframe = initialDataframe.map {
    case Row(col1: Seq[String], col2: Seq[String]) => (col1, col2, col1.union(col2))
  }.toDF("col1", "col2", "col3")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.