2

I have a dataframe with the following schema:

 |-- A: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- index: boolean (nullable = false)
 |-- idkey: string (nullable = true)

Since the value in the map is of type array, I need to extract the field index corresponding to the id in the "foreign" key field idkey.

For example, I have the following data:

 {"A":{
 "innerkey_1":[{"id":"1","type":"0.01","index":true},
               {"id":"6","type":"4.3","index":false}]},
 "1"}

Since the idkey is 1, we need to to output the value of index corresponding to the element where "id":1, i.e. the index should be equal to true. I am really not sure how I can accomplish this, with UDFs or otherwise.

Expected output is:

+---------+
| indexout|
+---------+
|   true  |
+---------+
6
  • can you clarify i.e. the index should be equal to 0 ?? and can you share your expected output too Commented Mar 13, 2018 at 5:35
  • 1
    and how can 1 be a boolean value? and type struct seems to be double not string. ?? Commented Mar 13, 2018 at 5:44
  • I have fixed the typos, thanks for pointing them out. Commented Mar 13, 2018 at 13:06
  • index false has id 6 . they don't match idkey with id. the matching index should be true. Commented Mar 13, 2018 at 13:18
  • aren't these Since the idkey is 1, we need to to output the value of index corresponding to the element where "id":1, i.e. the index should be equal to false contradicting with each other? Commented Mar 13, 2018 at 13:59

1 Answer 1

3

If your dataframe has following schema

root
 |-- A: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- types: string (nullable = true)
 |    |    |    |-- index: boolean (nullable = false)
 |-- idkey: string (nullable = true)

then you can use two explode function, one for the map and other for the inner array, use a filter to filter the match and finally select the index as

import org.apache.spark.sql.functions._
df.select(col("idkey"), explode(col("A")))
  .select(col("idkey"), explode(col("value")).as("value"))
  .filter(col("idkey") === col("value.id"))
  .select(col("value.index").as("indexout"))

You should get

+--------+
|indexout|
+--------+
|true    |
+--------+

Using udf function

You can do the above by using a udf function which would avoid the two explode and a filter too. all of the explodes and filter is done in udf function itself. You can modify according to your needs.

import org.apache.spark.sql.functions._
def indexoutUdf = udf((a: Map[String, Seq[Row]], idkey: String) => {
  a.map(x => x._2.filter(y => y.getAs[String](0) == idkey).map(y => y.getAs[Boolean](2))).toList(0).head
})
df.select(indexoutUdf(col("A"), col("idkey")).as("indexout")).show(false)

I hope the answer is helpful

Sign up to request clarification or add additional context in comments.

2 Comments

Is there a way to do it other than using explode? I considered it but it will be too expensive for large dataframes.
@PramodKumar, I have updated the answer :) I hope the answer is going to be upvoted and accepted this time ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.