spark dataframe select values from multiple columns based on condition

Question

Data schema,

root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)


|id|col1         |col2               |
|1 |["x","y","z"]|[123,"null","null"]|

From above data i want to filter where x exits in col1 and respective value for x from col2. (values of col1 and col2 ordered.If x index 2 in col1 and value index at col2 also 2)

Result:(Need col1 and col2 type array type)

|id |col1 |col2 |
|1  |["x"]|[123]|

If x not present in col1 then need result like

|id| col1    |col2 |
|1 |["null"] |["null"]|

i tried,

val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))

which Spark Version and are numbers of fields in array is fixed? — dassum
– dassum, Commented Nov 24, 2019 at 17:41

Charlie Flowers · Accepted Answer · 2019-11-24 21:44:26Z

The trick is to transform your data from dumb string columns into a more useable data structure. Once col1 and col2 are rebuilt as arrays (or as a map, as your desired output suggests they should be), you can use Spark's built-in functions rather than a messy UDF as suggested by @baitmbarek.

To start, use trim and split to convert col1 and col2 to arrays:

scala> val df = Seq(
     |       ("1", """["x","y","z"]""","""[123,"null","null"]"""),
     |         ("2", """["a","y","z"]""","""[123,"null","null"]""")
     |     ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]

scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
                        .withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_array.show(false)
+---+---------+-----------------+
|id |col1     |col2             |
+---+---------+-----------------+
|1  |[x, y, z]|[123, null, null]|
|2  |[a, y, z]|[123, null, null]|
+---+---------+-----------------+


scala> df_array.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

From here, you should be able to achieve what you want using array_position to find the index of 'x' (if any) in col1 and retrieve the matching data from col2. However, converting the two arrays into a map first should make it clearer to understand what your code is doing:

scala> val df_map = df_array.select(
                        $"id", 
                        map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
                        )
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]

scala> df_map.show(false)
+---+--------------------------------+
|id |col_map                         |
+---+--------------------------------+
|1  |[x -> 123, y -> null, z -> null]|
|2  |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
                                $"id",
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(lit("x")))
                                .as("col1"),  
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(element_at($"col_map", "x")))
                                .as("col2")
                                )
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_final.show
+---+------+------+
| id|  col1|  col2|
+---+------+------+
|  1|   [x]| [123]|
|  2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- col2: array (nullable = false)
 |    |-- element: string (containsNull = true)

Collectives™ on Stack Overflow

spark dataframe select values from multiple columns based on condition

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related