0

Input data-frame:

{
  "F1" : "A",
  "F2" : "B",
  "F3" : [
            {
              "name" : "N1",
              "sf1" : "val_1",
              "sf2" : "val_2"
            },
            {
              "name" : "N2",
              "sf1" : "val_3",
              "sf2" : "val_4"
            }
         ],
  "F4" : {
        "SF1" : "val_5",
        "SF2" : "val_6",
        "SF3" : "val_7"
  }
}

Desired output:

[
  {
    "F1" : "A",
    "F2" : "B",

    "F3_name" : "N1",
    "F3_sf1" : "val_1",
    "F3_sf2" : "val_2",
    
    "F4_SF1" : "val_7",
    "F4_SF2" : "val_8",
    "F4_SF3" : "val_9",
  },
  {
    "F1" : "A",
    "F2" : "B",

    "F3_name" : "N2",
    "F3_sf1" : "val_3",
    "F3_sf2" : "val_4",
    
    "F4_SF1" : "val_7",
    "F4_SF2" : "val_8",
    "F4_SF3" : "val_9",
  }
]

F3 is an array of struct. The new data-frame is supposed to be flat and have converted this single row into one or more rows(2 in this example) based on the number of items in F3.

I am new to Spark & Scala. Any thought on how of achieve this transformation will be very helpful.

Thanks!

2 Answers 2

1

You could also first use an explode. Then, you could extract and rename the fields with a series of aliases (e.g., $"F3.name" as "F3_name"):

scala> case class NameSF(name: String, sf1: String, sf2: String)
defined class NameSF

scala> case class SF(SF1: String, SF2: String, SF3: String)
defined class SF

scala> case class F(F1: String, F2: String, F3: Array[NameSF], F4: SF)
defined class F

scala> val elem = F("A",
     |              "B",
     |              Array(NameSF("N1", "val_1", "val_2"), NameSF("N2", "val_3", "val_4")),
     |              SF("val_5", "val_6", "val_7"))
elem: F = F(A,B,[LNameSF;@2939bfa0,SF(val_5,val_6,val_7))

scala> val df = spark.createDataset(Seq(elem)).toDF
df: org.apache.spark.sql.DataFrame = [F1: string, F2: string ... 2 more fields]

scala> df.withColumn("F3", explode($"F3")).select($"F1",
     |                                            $"F2",
     |                                            $"F3.name" as "F3_name",
     |                                            $"F3.sf1" as "F3_sf1",
     |                                            $"F3.sf2" as "F3_sf2",
     |                                            $"F4.SF1" as "F4_SF1",
     |                                            $"F4.SF2" as "F4_SF2",
     |                                            $"F4.SF3" as "F4_SF3").show
+---+---+-------+------+------+------+------+------+                            
| F1| F2|F3_name|F3_sf1|F3_sf2|F4_SF1|F4_SF2|F4_SF3|
+---+---+-------+------+------+------+------+------+
|  A|  B|     N1| val_1| val_2| val_5| val_6| val_7|
|  A|  B|     N2| val_3| val_4| val_5| val_6| val_7|
+---+---+-------+------+------+------+------+------+
Sign up to request clarification or add additional context in comments.

Comments

1

You can use inline to explode and expand F3, and * to expand F4:

val df2 = df.selectExpr("F1","F2","inline(F3)","F4.*")

df2.show
+---+---+----+-----+-----+-----+-----+-----+
| F1| F2|name|  sf1|  sf2|  SF1|  SF2|  SF3|
+---+---+----+-----+-----+-----+-----+-----+
|  A|  B|  N1|val_1|val_2|val_5|val_6|val_7|
|  A|  B|  N2|val_3|val_4|val_5|val_6|val_7|
+---+---+----+-----+-----+-----+-----+-----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.