1

I have a column that contains array of objects as a value.

Objects have the following structure:

[
  {
    "key": "param1",
    "val": "value1"
  },
  {
    "key": "param2",
    "val": "value2"
  },
  {
    "key": "param3",
    "val": "value3"
  }
]
someColumn colName
text [{key: "param1", val: "value1"}, {key: "param2", val: "value2"}, {key: "param3", val: "value3"}]

When I do:

df.withColumn("exploded", explode(col("colName")))

I get

someColumn exploded
text {key: "param1", val: "value1"}
text {key: "param2", val: "value2"}
text {key: "param3", val: "value3"}

Then I do next:

df.select("*", "exploded.*").drop("exploded")

I get this:

someColumn key value
text param1 value1
text param2 value2
text param3 value3

I understand why I get such result but I need to get other structure.

I want to get next result:

someColumn param1 param2 param3
text value1 value2 value3

Maybe do I have to transform array of Object[key, value] to Map and then to transform Map to Columns? What is the sequence of transformations I have to do?

1
  • Try pivot to DataFrame? Commented Aug 1, 2022 at 9:53

2 Answers 2

2

Once you explode your dataset, you can:

df = df.groupBy("someColumn").pivot("exploded.key").agg(first("exploded.val"))

This is created from the above statement:

+----------+------+------+------+
|someColumn|param1|param2|param3|
+----------+------+------+------+
|text      |value1|value2|value3|
+----------+------+------+------+

which is what you like!

Sign up to request clarification or add additional context in comments.

2 Comments

Yeah, it works. What do you think about another solution that I came up with?
If the keys are fixed, than your solution is good. But you want to avoid .collect() as that triggers an action already, for big data processing, I think you should go with pivot or a solution that does not include actions
0

I found one more solution:

val mappedDF = df
      .select(
        $"*",
        col("ColName").getField("key").as("keys"),
        col("ColName").getField("val").as("values")
      )
      .drop("ColName")
      .select(
        $"*",
        map_from_arrays($"keys", $"values").as("ColName")
      )

    val keysDF = mappedDF.select(explode(map_keys($"CalName"))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> col("CalName").getItem(f).as(f.toString))
    mappedDF.select(col("*") +: keyCols:_*).drop("CalName")

This solution work faster than pivot. But I'm not sure that it's the best solution.

BTW If we know list of keys and this list is fixed, this solution becomes more faster because we don't have to get list of keys from DF.

I wrote more universal code when we need to group by a few cols. In my post I simplified example for understanding purposes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.