0

Data schema,

root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)


|id|col1         |col2               |
|1 |["x","y","z"]|[123,"null","null"]|

From above data i want to filter where x exits in col1 and respective value for x from col2. (values of col1 and col2 ordered.If x index 2 in col1 and value index at col2 also 2)

Result:(Need col1 and col2 type array type)

|id |col1 |col2 |
|1  |["x"]|[123]|

If x not present in col1 then need result like

|id| col1    |col2 |
|1 |["null"] |["null"]|

i tried,

val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))

1
  • which Spark Version and are numbers of fields in array is fixed? Commented Nov 24, 2019 at 17:41

1 Answer 1

1

The trick is to transform your data from dumb string columns into a more useable data structure. Once col1 and col2 are rebuilt as arrays (or as a map, as your desired output suggests they should be), you can use Spark's built-in functions rather than a messy UDF as suggested by @baitmbarek.

To start, use trim and split to convert col1 and col2 to arrays:

scala> val df = Seq(
     |       ("1", """["x","y","z"]""","""[123,"null","null"]"""),
     |         ("2", """["a","y","z"]""","""[123,"null","null"]""")
     |     ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]

scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
                        .withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_array.show(false)
+---+---------+-----------------+
|id |col1     |col2             |
+---+---------+-----------------+
|1  |[x, y, z]|[123, null, null]|
|2  |[a, y, z]|[123, null, null]|
+---+---------+-----------------+


scala> df_array.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)

From here, you should be able to achieve what you want using array_position to find the index of 'x' (if any) in col1 and retrieve the matching data from col2. However, converting the two arrays into a map first should make it clearer to understand what your code is doing:

scala> val df_map = df_array.select(
                        $"id", 
                        map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
                        )
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]

scala> df_map.show(false)
+---+--------------------------------+
|id |col_map                         |
+---+--------------------------------+
|1  |[x -> 123, y -> null, z -> null]|
|2  |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
                                $"id",
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(lit("x")))
                                .as("col1"),  
                                when(isnull(element_at($"col_map", "x")), 
                                    array(lit("null")))
                                .otherwise(
                                    array(element_at($"col_map", "x")))
                                .as("col2")
                                )
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]

scala> df_final.show
+---+------+------+
| id|  col1|  col2|
+---+------+------+
|  1|   [x]| [123]|
|  2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
 |-- id: string (nullable = true)
 |-- col1: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- col2: array (nullable = false)
 |    |-- element: string (containsNull = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.