First extract the keys:
val input = Seq(Map("col1" -> "val1"), Map("col2" -> "val2", "col1" -> "val1"), Map("col3" -> "val3"))
val keys = input.flatMap(_.keys.toSeq).distinct
you will then need a method to fill all the non existing keys on each Map object with null as follows:
def completeNonExistingValuesWithNull(obj: Map[String, String])(columnNames: String*): Map[String, String] = {
val nonExistingKeys = columnNames.filterNot(obj.keys.toSeq.contains)
obj concat Map(
nonExistingKeys.map { key =>
key -> (null: String)
}: _*
)
}
// I would also rather creating a function value to use in
// map functions easily so I wont need to pass the keys everytime
val completeNonExistingValues: Map[String, String] => Map[String, String] =
obj => completeNonExistingValuesWithNull(obj)(keys: _*)
One other thing you need, is to be able to convert sequences to tuples in order to create rows for your dataframe (since sequence objects are counted as a single object with ArrayType struct)
def toProduct(seq: Seq[String]) = seq match {
case Seq(a, b, c) => (a, b, c)
}
and it's done:
val completedKeyValues: Seq[Map[String, String]] =
input.map(completeNonExistingValues)
val objects = completedKeyValues.map(_.values.toSeq).map(toProduct)
import spark.implicits._
objects.toDF(keys: _*)