0

I have a dataset in the RDD format, where each entry is an Array[Array[String]]. Each entry is an array of key/value pairs, and each entry may not contain all possible keys.

An example of a possible entry is [[K1, V1], [K2, V2], [K3, V3], [K5, V5], [K7, V7]] and another might be [[K1, V1], [K3, V3], [K21, V21]].

What I hope to achieve is to bring this RDD into a dataframe format. K1, K2, etc. always represent the same String over each of the rows (i.e. K1 is always "type" and K2 is always "color"), and I want to use these as the columns. The values V1, V2, etc. differ over rows, and I want to use these to populate the values for the columns.

I'm not sure how to achieve this, so I would appreciate any help/pointers.

1 Answer 1

1

You can do something like,

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}
import java.util.UUID
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType

    val l1: Array[Array[String]] = Array(
      Array[String]("K1", "V1"),
      Array[String]("K2", "V2"),
      Array[String]("K3", "V3"),
      Array[String]("K5", "V5"),
      Array[String]("K7", "V7"))

    val l2: Array[Array[String]] = Array(
      Array[String]("K1", "V1"),
      Array[String]("K3", "V3"),
      Array[String]("K21", "V21"))

    val spark = SparkSession.builder().master("local").getOrCreate()
    val sc = spark.sparkContext

    val rdd = sc.parallelize(Array(l1, l2)).flatMap(x => {
      val id = UUID.randomUUID().toString
      x.map(y => Row(id, y(0), y(1)))
    })

    val schema = new StructType()
      .add("id", "String")
      .add("key", "String")
      .add("value", "String")

    val df = spark
      .createDataFrame(rdd, schema)
      .groupBy("id")
      .pivot("key").agg(last("value"))
      .drop("id")

    df.printSchema()
    df.show(false)

The schema and output looks something like,

root
 |-- K1: string (nullable = true)
 |-- K2: string (nullable = true)
 |-- K21: string (nullable = true)
 |-- K3: string (nullable = true)
 |-- K5: string (nullable = true)
 |-- K7: string (nullable = true)

+---+----+----+---+----+----+
|K1 |K2  |K21 |K3 |K5  |K7  |
+---+----+----+---+----+----+
|V1 |null|V21 |V3 |null|null|
|V1 |V2  |null|V3 |V5  |V7  |
+---+----+----+---+----+----+

Note: this will produce null in missing places as shown in outputs. pivot basically transposes the data set based on some column Hope this answers your question!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.