4

I have a Dataframe containing 3 columns

| str1      | array_of_str1        | array_of_str2  |
+-----------+----------------------+----------------+
| John      | [Size, Color]        | [M, Black]     |
| Tom       | [Size, Color]        | [L, White]     |
| Matteo    | [Size, Color]        | [M, Red]       |

I want to add the Array column that contains the 3 columns in a struct type

| str1      | array_of_str1        | array_of_str2  | concat_result                                 |
+-----------+----------------------+----------------+-----------------------------------------------+
| John      | [Size, Color]        | [M, Black]     | [[[John, Size , M], [John, Color, Black]]]    |
| Tom       | [Size, Color]        | [L, White]     | [[[Tom, Size , L], [Tom, Color, White]]]      |
| Matteo    | [Size, Color]        | [M, Red]       | [[[Matteo, Size , M], [Matteo, Color, Red]]]  |
7
  • can you provide schama of dataframe Commented Jan 6, 2020 at 7:16
  • |-- AdditionalAttribute: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- str1: string (nullable = true) | | | |-- array_of_str1: string (nullable = true) | | | |-- array_of_str2: string (nullable = true) Commented Jan 6, 2020 at 9:31
  • @lazycoder, what is AdditionalAttribute, is this a field-name of the column concat_result? is your spark 2.4+ or below? Commented Jan 8, 2020 at 3:09
  • @jxc AdditionalAttribute is the array name. I'm using Spark 2.4.3 Commented Jan 8, 2020 at 8:40
  • @lazycoder, so AdditionalAttribute is your desired column name, not concat_result shown in your post? and the new column has a schema of array of structs with 3 string fields? Commented Jan 8, 2020 at 12:14

2 Answers 2

9

If the number of elements in the arrays in fixed, it is quite straightforward using the array and struct functions. Here is a bit of code in scala.

val result = df
    .withColumn("concat_result", array((0 to 1).map(i => struct(
                     col("str1"),
                     col("array_of_str1").getItem(i),
                     col("array_of_str2").getItem(i)
    )) : _*))

And in python, since you were asking about pyspark:

import pyspark.sql.functions as F

df.withColumn("concat_result", F.array(*[ F.struct(
                  F.col("str1"),
                  F.col("array_of_str1").getItem(i),
                  F.col("array_of_str2").getItem(i))
              for i in range(2)]))

And you get the following schema:

root
 |-- str1: string (nullable = true)
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- concat_result: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)
Sign up to request clarification or add additional context in comments.

1 Comment

what if the size of the array_of_str1 and array_of_str2 is variable. How can i dynamically specify the range here
0

Spark >= 2.4.x

For dynamically values you can use high-order functions:

import pyspark.sql.functions as f

expr = "TRANSFORM(arrays_zip(array_of_str1, array_of_str2), x -> struct(str1, concat(x.array_of_str1), concat(x.array_of_str2)))"
df = df.withColumn('concat_result', f.expr(expr))

df.show(truncate=False)

Schema and output:

root
 |-- array_of_str1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- array_of_str2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str1: string (nullable = true)
 |-- concat_result: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- str1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: string (nullable = true)

+-------------+-------------+------+-----------------------------------------+
|array_of_str1|array_of_str2|str1  |concat_result                            |
+-------------+-------------+------+-----------------------------------------+
|[Size, Color]|[M, Black]   |John  |[[John, Size, M], [John, Color, Black]]  |
|[Size, Color]|[L, White]   |Tom   |[[Tom, Size, L], [Tom, Color, White]]    |
|[Size, Color]|[M, Red]     |Matteo|[[Matteo, Size, M], [Matteo, Color, Red]]|
+-------------+-------------+------+-----------------------------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.