0

I having scala array of type Array[Map[String,String]] and i want to convert it into spark df.

input:- Array(Map("col1" -> "val1"), Map("col2" -> "val2", "col1" -> "val1"), Map("col3" -> "val3") )

expected output:-

Spark df

col1 col2 col3
val1 NA NA
val1 val2 NA
NA NA val3

What is best way to do this?

1 Answer 1

1

First extract the keys:

val input = Seq(Map("col1" -> "val1"), Map("col2" -> "val2", "col1" -> "val1"), Map("col3" -> "val3"))
val keys = input.flatMap(_.keys.toSeq).distinct

you will then need a method to fill all the non existing keys on each Map object with null as follows:

def completeNonExistingValuesWithNull(obj: Map[String, String])(columnNames: String*): Map[String, String] = {
  val nonExistingKeys = columnNames.filterNot(obj.keys.toSeq.contains)
  obj concat Map(
    nonExistingKeys.map { key =>
      key -> (null: String)
    }: _*
  )
}
// I would also rather creating a function value to use in
// map functions easily so I wont need to pass the keys everytime
val completeNonExistingValues: Map[String, String] => Map[String, String] = 
    obj => completeNonExistingValuesWithNull(obj)(keys: _*)

One other thing you need, is to be able to convert sequences to tuples in order to create rows for your dataframe (since sequence objects are counted as a single object with ArrayType struct)

def toProduct(seq: Seq[String]) = seq match {
  case Seq(a, b, c) => (a, b, c)
}

and it's done:

val completedKeyValues: Seq[Map[String, String]] =
  input.map(completeNonExistingValues)

val objects = completedKeyValues.map(_.values.toSeq).map(toProduct)

import spark.implicits._
objects.toDF(keys: _*)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.