2

My downstream source does not support a Map type and my source does and as such sends this. I need to convert this map into an array of struct (tuple).

Scala support Map.toArray which creates an array of tuple for you which seems like the function I need on the Map to transform:

{
  "a" : {
    "b": {
      "key1" : "value1",
      "key2" : "value2"
    },
    "b_" : {
      "array": [
        {
          "key": "key1",
          "value" : "value1"
        },
        {
          "key": "key2",
          "value" : "value2"
        }
      ]
    }
  }
}

What is the most efficient way in Spark to do this assuming that also the field to change is a nested one. e.g

a is the root level dataframe column

a.b is the map at level 1 (comes from the source)

a.b_ is the array type of struct (this is what I want to generate in converting a.b to the array)

The answer so far goes some of the way I think, just can get the withColumn and UDF suggested to generate as below.

Thanks!

6
  • Can you update your question with some sample source data? And do you want help with dataframe or rdd? Commented May 14, 2017 at 16:41
  • Thanks just did this and in a DataFrame not RDD. Think the current UDF answer is close, cant quite get the nesting to work. Also, be good to be able to specify the types more generically for reuse as we have string -> boolean, string -> string and string -> int maps. Hope you can help thanks Commented May 16, 2017 at 20:49
  • Hi Ramesh is this enough info for you now? Thanks heaps Commented May 17, 2017 at 5:41
  • @ramesh-maharjan Is the new info enough so can you help with this? Thanks! :) Commented May 18, 2017 at 5:52
  • I thought you already got the answer as you accepted an answer. Let me see what I can do about it. Give me some time. Commented May 18, 2017 at 6:00

1 Answer 1

1

Just use an udf:

val toArray = udf((vs: Map[String, String]) => vs.toArray)

and adjust input type according to your needs.

Sign up to request clarification or add additional context in comments.

5 Comments

Great thanks heaps for the reply, what if I need to rename the tuple key & value field names (from col_1 and col_2 to key and value)?
...also if you check out [link] (dropbox.com/s/cbagegoiiomei9d/…), I am trying to set the availability.available_ field (using the available field above as the Map input) which is an array of struct type.... was trying val availability_DF = allProductsDF .select("*") .withColumn("availability.available_", toArray($"availability.available")) but of course didnt add this to the nested struct
any extra help on this would be greatly appreciated thanks!
I updated the example dataFrame in the description, can you expand on your udf soluiton to accomodate the nested position. a.b is what I get as my input, a.b_ is what I need but the elements are at a nested level in the tree, a is the root? Refer example in description thanks heaps!
Thanks for this - one additional requirement, if I need to have the key and value names in the resulting tuple different to the auto-generated _1, _2 ( e.g.... [{"_1":"aKey","_2":"aValue"}] to ...... [{"key":"aKey","value":"aValue"}] instead hwo would you update the UDF to do so? Many thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.