4

I have a spark dataframe which looks something like below:

+---+------+----+
| id|animal|talk|
+---+------+----+
|  1|   bat|done|
|  2| mouse|mone|
|  3| horse| gun|
|  4| horse|some|
+---+------+----+

I want to generate a new column, say merged which would look something like

+---+-----------------------------------------------------------+
| id| merged columns                                            |
+---+-----------------------------------------------------------+
|  1| [{name: animal, value: bat}, {name: talk, value: done}]   |
|  2| [{name: animal, value: mouse}, {name: talk, value: mone}] |
|  3| [{name: animal, value: horse}, {name: talk, value: gun}]  |
|  4| [{name: animal, value: horse}, {name: talk, value: some}] |
+---+-----------------------------------------------------------+

Basically, combining all the columns into an Array of case class merged(name:String, value: String).

Can anyone help me with how to do this in Scala? Here for simplicity I have used only two columns but generic answer which can work for N number of columns would greatly help.

1 Answer 1

4

Your expected output doesn't seem to reflect your requirement of producing a list of name-value structured objects. If I understand it correctly, consider using foldLeft to iteratively convert the wanted columns to StructType name-value columns, and group them into an ArrayType column:

import org.apache.spark.sql.functions._

val df = Seq(
  (1, "bat", "done"),
  (2, "mouse", "mone"),
  (3, "horse", "gun"),
  (4, "horse", "some")
).toDF("id", "animal", "talk")

val cols = df.columns.filter(_ != "id")

val resultDF = cols.
  foldLeft(df)( (accDF, c) => 
    accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
  ).
  select($"id", array(cols.map(col): _*).as("merged"))

resultDF.show(false)
// +---+-----------------------------+
// |id |merged                       |
// +---+-----------------------------+
// |1  |[[animal,bat], [talk,done]]  |
// |2  |[[animal,mouse], [talk,mone]]|
// |3  |[[animal,horse], [talk,gun]] |
// |4  |[[animal,horse], [talk,some]]|
// +---+-----------------------------+

resultDF.printSchema
// root
//  |-- id: integer (nullable = false)
//  |-- merged: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- name: string (nullable = false)
//  |    |    |-- value: string (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.