0

I have below dataset:

{
  "col1": "val1",
  "col2": {
    "key1": "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
    "key2": "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"
  }
}

with schema StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StringType,true),true)).

I want to convert col2 to below format:

{
  "col1": "val1",
  "col2": {
    "key1": {"SubCol1":"ABCD","SubCol2":"EFGH"},
    "key2": {"SubCol1":"IJKL","SubCol2":"MNOP"}
  }
}

The updated dataset schema will be as below:

StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StructType(StructField(SubCol1,StringType,true), StructField(SubCol2,StringType,true)),true),true))

0

2 Answers 2

2

You can use transform_values on the map column:

val df2 = df.withColumn(
    "col2", 
    expr("transform_values(col2, (k, x) -> from_json(x, 'struct<SubCol1:string, SubCol2:string>'))")
)
Sign up to request clarification or add additional context in comments.

1 Comment

I am using Spark version 2.4.7 and transform_values is available in version 3.0+. Is there any other alterative to do the Map's values transformation
1

Try below code It will work in spark 2.4.7

Creating DataFrame with sample data.

scala> val df = Seq(
("val1",Map(
            "key1" -> "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
            "key2" -> "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"))
).toDF("col1","col2")

df: org.apache.spark.sql.DataFrame = [col1: string, col2: map<string,string>]

Steps:

  1. Extract map keys (map_keys), values (map_values) into different arrays.
  2. Convert map values into desired output. i.e. Struct
  3. Use map_from_arrays function to combine keys & values from the above steps to create Map[String, Struct]
scala> 
val finalDF = df
.withColumn(
            "col2_new",
            map_from_arrays(
                map_keys($"col2"),
                expr("""transform(map_values(col2), x -> from_json(x,"struct<SubCol1:string, SubCol2:string>"))""")
            )
)

Printing Schema

finalDF.printSchema
root
 |-- col1: string (nullable = true)
 |-- col2: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- col2_new: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- SubCol1: string (nullable = true)
 |    |    |-- SubCol2: string (nullable = true)

Printing Final Output

+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|col1|col2                                                                                      |col2_new                                    |
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|val1|[key1 -> {"SubCol1":"ABCD","SubCol2":"EFGH"}, key2 -> {"SubCol1":"IJKL","SubCol2":"MNOP"}]|[key1 -> [ABCD, EFGH], key2 -> [IJKL, MNOP]]|
+----+------------------------------------------------------------------------------------------+--------------------------------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.