33

I am using Spark SQL (I mention that it is in Spark in case that affects the SQL syntax - I'm not familiar enough to be sure yet) and I have a table that I am trying to re-structure, but I'm getting stuck trying to transpose multiple columns at the same time.

Basically I have data that looks like:

userId    someString      varA     varB
   1      "example1"    [0,2,5]   [1,2,9]
   2      "example2"    [1,20,5]  [9,null,6]

and I'd like to explode both varA and varB simultaneously (the length will always be consistent) - so that the final output looks like this:

userId    someString      varA     varB
   1      "example1"       0         1
   1      "example1"       2         2
   1      "example1"       5         9
   2      "example2"       1         9
   2      "example2"       20       null
   2      "example2"       5         6

but I can only seem to get a single explode(var) statement to work in one command, and if I try to chain them (ie create a temp table after the first explode command) then I obviously get a huge number of duplicate, unnecessary rows.

Many thanks!

3 Answers 3

51

Spark >= 2.4

You can skip zip udf and use arrays_zip function:

df.withColumn("vars", explode(arrays_zip($"varA", $"varB"))).select(
  $"userId", $"someString",
  $"vars.varA", $"vars.varB").show

Spark < 2.4

What you want is not possible without a custom UDF. In Scala you could do something like this:

val data = sc.parallelize(Seq(
    """{"userId": 1, "someString": "example1",
        "varA": [0, 2, 5], "varB": [1, 2, 9]}""",
    """{"userId": 2, "someString": "example2",
        "varA": [1, 20, 5], "varB": [9, null, 6]}"""
))

val df = spark.read.json(data)

df.printSchema
// root
//  |-- someString: string (nullable = true)
//  |-- userId: long (nullable = true)
//  |-- varA: array (nullable = true)
//  |    |-- element: long (containsNull = true)
//  |-- varB: array (nullable = true)
//  |    |-- element: long (containsNull = true)

Now we can define zip udf:

import org.apache.spark.sql.functions.{udf, explode}

val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))

df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
   $"userId", $"someString",
   $"vars._1".alias("varA"), $"vars._2".alias("varB")).show

// +------+----------+----+----+
// |userId|someString|varA|varB|
// +------+----------+----+----+
// |     1|  example1|   0|   1|
// |     1|  example1|   2|   2|
// |     1|  example1|   5|   9|
// |     2|  example2|   1|   9|
// |     2|  example2|  20|null|
// |     2|  example2|   5|   6|
// +------+----------+----+----+

With raw SQL:

sqlContext.udf.register("zip", (xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.registerTempTable("df")

sqlContext.sql(
  """SELECT userId, someString, explode(zip(varA, varB)) AS vars FROM df""")
Sign up to request clarification or add additional context in comments.

8 Comments

Can this be applied on 3 columns which are of type sequence?
@AmitKumar Yeah, why not? You'll have to adjust signature and body but it is not hard.
I wonder if in the newer datasets API you could just use map and zip the arrays together without creating the UDF and whether it would be faster/ scale/ be optimised by the catalyst execution engine. I'll try it when at a console.
@zero323 can you please help me how to write the above UDF in java for more than three columns?
@SatishKaruturi, Starting Spark 2.4.0 you no longer need to write your own UDF. There is a new arrays_zip function which can be applied to multiple columns.
|
1

You could also try

case class Input(
 userId: Integer,
 someString: String,
 varA: Array[Integer],
 varB: Array[Integer])

case class Result(
 userId: Integer,
 someString: String,
 varA: Integer,
 varB: Integer)

def getResult(row : Input) : Iterable[Result] = {
 val user_id = row.user_id
 val someString = row.someString
 val varA = row.varA
 val varB = row.varB
 val seq = for( i <- 0 until varA.size) yield {Result(user_id,someString,varA(i),varB(i))}
 seq
 }

val obj1 = Input(1, "string1", Array(0, 2, 5), Array(1, 2, 9))
val obj2 = Input(2, "string2", Array(1, 3, 6), Array(2, 3, 10))
val input_df = sc.parallelize(Seq(obj1, obj2)).toDS

val res = input_df.flatMap{ row => getResult(row) }
res.show
// +------+----------+----+-----+
// |userId|someString|varA|varB |
// +------+----------+----+-----+
// |     1|  string1 |   0|   1 |
// |     1|  string1 |   2|   2 |
// |     1|  string1 |   5|   9 |
// |     2|  string2 |   1|   2 |
// |     2|  string2 |   3|   3 |
// |     2|  string2 |   6|   10|
// +------+----------+----+-----+

Comments

0

This will work even if we have more than 3 columns

case class Input(user_id: Integer, someString: String, varA: Array[Integer], varB: Array[Integer], varC: Array[String], varD: Array[String])

val obj1 = Input(1, "example1", Array(0,2,5), Array(1,2,9), Array("a","b","c"), Array("red","green","yellow"))
val obj2 = Input(2, "example2", Array(1,20,5), Array(9,null,6), Array("d","e","f"), Array("white","black","cyan"))
val obj3 = Input(3, "example3", Array(10,11,12), Array(5,8,7), Array("g","h","i"), Array("blue","pink","brown"))

val input_df = sc.parallelize(Seq(obj1, obj2, obj3)).toDS
input_df.show()

val zip = udf((a: Seq[String], b: Seq[String], c: Seq[String], d: Seq[String]) => {a.indices.map(i=> (a(i), b(i), c(i), d(i)))})

val output_df = input_df.withColumn("vars", explode(zip($"varA", $"varB", $"varC", $"varD"))).
                         select($"user_id", $"someString", $"vars._1".alias("varA"), $"vars._2".alias("varB"), $"vars._3".alias("varC"), $"vars._4".alias("varD"))
output_df.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.