1

I have two ArrayType(StringType()) columns in a spark dataframe, and I want to concatenate the two columns element-wise:

input:

+-------------+-------------+
|col1         |col2         |
+-------------+-------------+
|['a','b']    |['c','d']    |
|['a','b','c']|['e','f','g']|
+-------------+-------------+

output:

+-------------+-------------+----------------+
|col1         |col2         |col3            |
+-------------+-------------+----------------+
|['a','b']    |['c','d']    |['ac', 'bd']    |
|['a','b','c']|['e','f','g']|['ae','bf','cg']|
+-------------+----------- -+----------------+

Thanks.

0

4 Answers 4

8

For Spark 2.4+, you can use zip_with function:

zip_with(left, right, func) - Merges the two given arrays, element-wise, into a single array using function

df.withColumn("col3", expr("zip_with(col1, col2, (x, y) -> concat(x, y))")).show()

#+------+------+--------+
#|  col1|  col2|    col3|
#+------+------+--------+
#|[a, b]|[c, d]|[ac, bd]|
#+------+------+--------+

Another way using transform function like this:

df.withColumn("col3", expr("transform(col1, (x, i) -> concat(x, col2[i]))"))

The transform function takes as parameters the first array column col1, iterates over its elements and applies a lambda function (x, i) -> concat(x, col2[i]) where x the actual element and i its index used to get the corresponding element from array col2.

Sign up to request clarification or add additional context in comments.

Comments

1

Here is an alternative answer that can be used for the updated non-original question. Uses array and array_except to demonstrate the use of such methods. The accepted answer is more elegant.

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Arbitrary max number of elements to apply array over, need not broadcast such a small amount of data afaik.
max_entries = 5 

# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length,but per row constant length. 
dfA = spark.createDataFrame([   ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500])   ) for x in range(3)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1")))    
dfB = spark.createDataFrame([   ( list([x,x+1]), 4, list([x+100,x+200])   ) for x in range(5)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1"))) 
df = dfA.union(dfB)

# concat the array elements which are variable in size up to a max amount.
df2 = df.select(( [concat(col("array1")[i], col("array2")[i]) for i in range(max_entries)]))
df3 = df2.withColumn("res", array(df2.schema.names))

# Get results but strip out null entires from array.
df3.select(array_except(df3.res, array(lit(None)))).show(truncate=False)

Could not get the s value of column to be passed into range.

This returns:

+------------------------------+
|array_except(res, array(NULL))|
+------------------------------+
|[0100, 1200, 4999, 100500]    |
|[1101, 2201, 4999, 101501]    |
|[2102, 3202, 4999, 102502]    |
|[0100, 1200]                  |
|[1101, 2201]                  |
|[2102, 3202]                  |
|[3103, 4203]                  |
|[4104, 5204]                  |
+------------------------------+

1 Comment

@pault: can you see if I can pass a value in from the column s as opposed to using a variable. Tried expr, etc, to no avail.
0

It wouldn't really scale, but you could get the 0th and 1st entries in each array and then say col3 is a[0] + b[0] and then a[1] + b[1]. Make all 4 entries separate values and then output them combined.

2 Comments

Not really an answer as to how we define things.
number of elements could be different between rows.
0

Here is a generic answer. Just look at res for the result. 2 equally sized arrays, thus n elements for both.

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length, but both the same length as in your example
df = spark.createDataFrame([   ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500])   ) for x in range(3)], ['array1', 'value1', 'array2'] )    
num_array_elements = len(df.select("array1").first()[0])

# concat
df2 = df.select(([ concat(col("array1")[i], col("array2")[i]) for i in range(num_array_elements)]))
df2.withColumn("res", array(df2.schema.names)).show(truncate=False)

returns:

enter image description here

1 Comment

number of elements in arrays could be different between rows

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.