0

Is their any possible way to flatten an array in Scala DF?

As I know with columns and select filed.a works, but I don't want to specify them Manually.

  df.printSchema()
 |-- client_version: string (nullable = true)
 |-- filed: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: string (nullable = true)
 |    |    |-- d: string (nullable = true)

final df

df.printSchema()
     |-- client_version: string (nullable = true)
     |-- filed_a: string (nullable = true)
     |-- filed_b: string (nullable = true)
     |-- filed_c: string (nullable = true)
     |-- filed_d: string (nullable = true)
1
  • Using df.select("filed.a") will not produce your desired schema. It will produce column of Arrays, not individual String values. Commented Dec 17, 2018 at 21:08

2 Answers 2

2

You can flatten your ArrayType column with explode and map the nested struct element names to the wanted top-level column names, as shown below:

import org.apache.spark.sql.functions._

case class S(a: String, b: String, c: String, d: String)

val df = Seq(
  ("1.0", Seq(S("a1", "b1", "c1", "d1"))),
  ("2.0", Seq(S("a2", "b2", "c2", "d2"), S("a3", "b3", "c3", "d3")))
).toDF("client_version", "filed")

df.printSchema
// root
//  |-- client_version: string (nullable = true)
//  |-- filed: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- a: string (nullable = true)
//  |    |    |-- b: string (nullable = true)
//  |    |    |-- c: string (nullable = true)
//  |    |    |-- d: string (nullable = true)

val dfFlattened = df.withColumn("filed_element", explode($"filed"))

val structElements = dfFlattened.select($"filed_element.*").columns

val dfResult = dfFlattened.select( col("client_version") +: structElements.map( 
    c => col(s"filed_element.$c").as(s"filed_$c")
  ): _*
)

dfResult.show
// +--------------+-------+-------+-------+-------+
// |client_version|filed_a|filed_b|filed_c|filed_d|
// +--------------+-------+-------+-------+-------+
// |           1.0|     a1|     b1|     c1|     d1|
// |           2.0|     a2|     b2|     c2|     d2|
// |           2.0|     a3|     b3|     c3|     d3|
// +--------------+-------+-------+-------+-------+

dfResult.printSchema
// root
//  |-- client_version: string (nullable = true)
//  |-- filed_a: string (nullable = true)
//  |-- filed_b: string (nullable = true)
//  |-- filed_c: string (nullable = true)
//  |-- filed_d: string (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

0

Use explode to flatten the arrays by adding more rows and then select with the * notation to bring the struct columns back to the top.

import org.apache.spark.sql.functions.{collect_list, explode, struct}
import spark.implicits._

val df = Seq(("1", "a", "a", "a"),
  ("1", "b", "b", "b"),
  ("2", "a", "a", "a"),
  ("2", "b", "b", "b"),
  ("2", "c", "c", "c"),
  ("3", "a", "a","a")).toDF("idx",  "A", "B", "C")
  .groupBy(("idx"))
  .agg(collect_list(struct("A", "B", "C")).as("nested_col"))

df.printSchema()
// root
//  |-- idx: string (nullable = true)
//  |-- nested_col: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- A: string (nullable = true)
//  |    |    |-- B: string (nullable = true)
//  |    |    |-- C: string (nullable = true)

df.show
// +---+--------------------+
// |idx|          nested_col|
// +---+--------------------+
// |  3|         [[a, a, a]]|
// |  1|[[a, a, a], [b, b...|
// |  2|[[a, a, a], [b, b...|
// +---+--------------------+

val dfExploded = df.withColumn("exploded", explode($"nested_col")).drop("nested_col")

dfExploded.show
// +---+---------+
// |idx| exploded|
// +---+---------+
// |  3|[a, a, a]|
// |  1|[a, a, a]|
// |  1|[b, b, b]|
// |  2|[a, a, a]|
// |  2|[b, b, b]|
// |  2|[c, c, c]|
// +---+---------+

val finalDF = dfExploded.select("idx", "exploded.*")

finalDF.show
// +---+---+---+---+
// |idx|  A|  B|  C|
// +---+---+---+---+
// |  3|  a|  a|  a|
// |  1|  a|  a|  a|
// |  1|  b|  b|  b|
// |  2|  a|  a|  a|
// |  2|  b|  b|  b|
// |  2|  c|  c|  c|
// +---+---+---+---+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.