How to concat two ArrayType(StringType()) columns element-wise in Pyspark?

Question

I have two ArrayType(StringType()) columns in a spark dataframe, and I want to concatenate the two columns element-wise:

input:

+-------------+-------------+
|col1         |col2         |
+-------------+-------------+
|['a','b']    |['c','d']    |
|['a','b','c']|['e','f','g']|
+-------------+-------------+

output:

+-------------+-------------+----------------+
|col1         |col2         |col3            |
+-------------+-------------+----------------+
|['a','b']    |['c','d']    |['ac', 'bd']    |
|['a','b','c']|['e','f','g']|['ae','bf','cg']|
+-------------+----------- -+----------------+

Thanks.

blackbishop · Accepted Answer · 2022-01-23 09:49:02Z

8

For Spark 2.4+, you can use zip_with function:

zip_with(left, right, func) - Merges the two given arrays, element-wise, into a single array using function

df.withColumn("col3", expr("zip_with(col1, col2, (x, y) -> concat(x, y))")).show()

#+------+------+--------+
#|  col1|  col2|    col3|
#+------+------+--------+
#|[a, b]|[c, d]|[ac, bd]|
#+------+------+--------+

Another way using transform function like this:

df.withColumn("col3", expr("transform(col1, (x, i) -> concat(x, col2[i]))"))

The transform function takes as parameters the first array column col1, iterates over its elements and applies a lambda function (x, i) -> concat(x, col2[i]) where x the actual element and i its index used to get the corresponding element from array col2.

edited Jan 23, 2022 at 9:49

answered Jan 10, 2020 at 19:20

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ged · Accepted Answer · 2020-01-14 19:32:30Z

Here is an alternative answer that can be used for the updated non-original question. Uses array and array_except to demonstrate the use of such methods. The accepted answer is more elegant.

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Arbitrary max number of elements to apply array over, need not broadcast such a small amount of data afaik.
max_entries = 5 

# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length,but per row constant length. 
dfA = spark.createDataFrame([   ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500])   ) for x in range(3)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1")))    
dfB = spark.createDataFrame([   ( list([x,x+1]), 4, list([x+100,x+200])   ) for x in range(5)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1"))) 
df = dfA.union(dfB)

# concat the array elements which are variable in size up to a max amount.
df2 = df.select(( [concat(col("array1")[i], col("array2")[i]) for i in range(max_entries)]))
df3 = df2.withColumn("res", array(df2.schema.names))

# Get results but strip out null entires from array.
df3.select(array_except(df3.res, array(lit(None)))).show(truncate=False)

Could not get the s value of column to be passed into range.

This returns:

+------------------------------+
|array_except(res, array(NULL))|
+------------------------------+
|[0100, 1200, 4999, 100500]    |
|[1101, 2201, 4999, 101501]    |
|[2102, 3202, 4999, 102502]    |
|[0100, 1200]                  |
|[1101, 2201]                  |
|[2102, 3202]                  |
|[3103, 4203]                  |
|[4104, 5204]                  |
+------------------------------+

@pault: can you see if I can pass a value in from the column s as opposed to using a variable. Tried expr, etc, to no avail.

Mileta Dulovic · Accepted Answer · 2020-01-10 19:43:15Z

0

It wouldn't really scale, but you could get the 0th and 1st entries in each array and then say col3 is a[0] + b[0] and then a[1] + b[1]. Make all 4 entries separate values and then output them combined.

edited Jan 10, 2020 at 19:43

Mileta Dulovic

1,0741 gold badge17 silver badges38 bronze badges

answered Jan 10, 2020 at 19:00

Boud225

337 bronze badges

2 Comments

Ged Over a year ago

Not really an answer as to how we define things.

ARCrow Over a year ago

number of elements could be different between rows.

Ged · Accepted Answer · 2020-01-11 20:54:52Z

0

Here is a generic answer. Just look at res for the result. 2 equally sized arrays, thus n elements for both.

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length, but both the same length as in your example
df = spark.createDataFrame([   ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500])   ) for x in range(3)], ['array1', 'value1', 'array2'] )    
num_array_elements = len(df.select("array1").first()[0])

# concat
df2 = df.select(([ concat(col("array1")[i], col("array2")[i]) for i in range(num_array_elements)]))
df2.withColumn("res", array(df2.schema.names)).show(truncate=False)

returns:

edited Jan 11, 2020 at 20:54

answered Jan 11, 2020 at 15:10

Ged

18.5k8 gold badges53 silver badges108 bronze badges

1 Comment

ARCrow Over a year ago

number of elements in arrays could be different between rows

Collectives™ on Stack Overflow

How to concat two ArrayType(StringType()) columns element-wise in Pyspark?

4 Answers 4

Comments

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related