2

I have a dataframe (spark) which has 2 columns each with list values. I want to create a new column which concatenates the 2 columns (as well as the list values inside the column). For e.g.

Column 1 has a row value - [A,B]

Column 2 has a row value - [C,D]

"The output should be in a new column i.e. "

Column 3(newly created column) with row value - [A,B,C,D]

Note:- Column values have values stored in LIST

Please help me implement this with pyspark. Thanks

2 Answers 2

4

we can use an UDF as,

 >>> from pyspark.sql import functions as F
 >>> from pyspark.sql.types import *
 >>> udf1 = F.udf(lambda x,y : x+y,ArrayType(StringType()))
 >>> df = df.withColumn('col3',udf1('col1','col2'))
Sign up to request clarification or add additional context in comments.

Comments

0

as general rule, if you want to join more list columns, I suggest to use chain from itertools

from itertools import chain
concat_list_columns = F.udf(lambda *list_: chain(*list_), ArrayType(StringType()))

Because udf are heavy on memory, a better solution would be to use pyspark function concat:

from pyspark.sql import functions as F
F.concat(col1, col2, col3) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.