Spark: How to convert array of objects with fields key-value into columns with keys as names

Question

I have a column that contains array of objects as a value.

Objects have the following structure:

[
  {
    "key": "param1",
    "val": "value1"
  },
  {
    "key": "param2",
    "val": "value2"
  },
  {
    "key": "param3",
    "val": "value3"
  }
]

someColumn	colName
text	`[{key: "param1", val: "value1"}, {key: "param2", val: "value2"}, {key: "param3", val: "value3"}]`

When I do:

df.withColumn("exploded", explode(col("colName")))

I get

someColumn	exploded
text	`{key: "param1", val: "value1"}`
text	`{key: "param2", val: "value2"}`
text	`{key: "param3", val: "value3"}`

Then I do next:

df.select("*", "exploded.*").drop("exploded")

I get this:

someColumn	key	value
text	param1	value1
text	param2	value2
text	param3	value3

I understand why I get such result but I need to get other structure.

I want to get next result:

someColumn	param1	param2	param3
text	value1	value2	value3

Maybe do I have to transform array of Object[key, value] to Map and then to transform Map to Columns? What is the sequence of transformations I have to do?

Try pivot to DataFrame?

mvasyliv
– mvasyliv

2022-08-01 09:53:07 +00:00
Commented Aug 1, 2022 at 9:53 — mvasyliv
– mvasyliv, Commented Aug 1, 2022 at 9:53

vilalabinot · Accepted Answer · 2022-08-01 14:25:46Z

2

Once you explode your dataset, you can:

df = df.groupBy("someColumn").pivot("exploded.key").agg(first("exploded.val"))

This is created from the above statement:

+----------+------+------+------+
|someColumn|param1|param2|param3|
+----------+------+------+------+
|text      |value1|value2|value3|
+----------+------+------+------+

which is what you like!

answered Aug 1, 2022 at 14:25

vilalabinot

1,6216 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alexander Lopatin Over a year ago

Yeah, it works. What do you think about another solution that I came up with?

vilalabinot Over a year ago

If the keys are fixed, than your solution is good. But you want to avoid .collect() as that triggers an action already, for big data processing, I think you should go with pivot or a solution that does not include actions

Alexander Lopatin · Accepted Answer · 2022-08-01 14:42:26Z

I found one more solution:

val mappedDF = df
      .select(
        $"*",
        col("ColName").getField("key").as("keys"),
        col("ColName").getField("val").as("values")
      )
      .drop("ColName")
      .select(
        $"*",
        map_from_arrays($"keys", $"values").as("ColName")
      )

    val keysDF = mappedDF.select(explode(map_keys($"CalName"))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> col("CalName").getItem(f).as(f.toString))
    mappedDF.select(col("*") +: keyCols:_*).drop("CalName")

This solution work faster than pivot. But I'm not sure that it's the best solution.

BTW If we know list of keys and this list is fixed, this solution becomes more faster because we don't have to get list of keys from DF.

I wrote more universal code when we need to group by a few cols. In my post I simplified example for understanding purposes.

Collectives™ on Stack Overflow

Spark: How to convert array of objects with fields key-value into columns with keys as names

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related