Explode Array[(Int, Int)] column from Spark Dataframe in Scala

Question

I want to explode Array[(Int, Int)]

INPUT:

colA newCol
1     [[11, 12],[13, 15]]
2     [[17, 91], [51, 72]]

OUTPUT:

colA newCol
1     11
1     13
2     17
2     51

My Schema looks like this:

 |-- colA: integer (nullable = true)
 |-- newCol: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- value: integer (nullable = true)
 |    |    |-- count: integer (nullable = true)

I've tried something below like this:

val res =  df.withColumn("tup", explode($"newCol")).select("colA", "tup")

res.select(col("colA"), col("tup")("value").as("uId"))

In your select statement just use tup.value to access the exploded column — Aaron
– Aaron, Commented Aug 1, 2019 at 9:56
@Aaron It fails: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2$mcII$sp is not a valid external type for schema of struct<value:int,count:int> — Raja Sabarish PV
– Raja Sabarish PV, Commented Aug 1, 2019 at 10:20

Goldie · Accepted Answer · 2019-08-01 13:33:47Z

1

You can try something like this.

val result = df.withColumn("resultColumn",explode(col("newCol").getItem("value")).select("colA","resultColumn")

so you are basically exploding the array and then taking the first element of the struct.

Edited:

Here is how i created the dataframe with the same schema.

scala> import spark.implicits._

scala> val df = spark.sparkContext.parallelize(List((1),(2))).toDF("id")

scala> val df1 = df.withColumn("col2",array(struct(lit(1),lit(2)),struct(lit(3),lit(4))))

scala> df1.printSchema
root
 |-- id: integer (nullable = false)
 |-- col2: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: integer (nullable = false)
 |    |    |-- col2: integer (nullable = false)


scala> df1.withColumn("resultColumn",explode(col("col2").getItem("col1"))).select("id","resultColumn").show
+---+------------+
| id|resultColumn|
+---+------------+
|  1|           1|
|  1|           3|
|  2|           1|
|  2|           3|
+---+------------+

edited Aug 1, 2019 at 13:33

answered Aug 1, 2019 at 11:04

Goldie

1641 silver badge12 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Raja Sabarish PV Over a year ago

It still throws the same exception which i wrote above This is what i did: df.withColumn("resultColumn",explode(col("newCol").getItem("value"))).select("colA","resultColumn").show()

Goldie Over a year ago

How are you creating the dataframe? It looks like something is wrong with the way you are creating the dataframe ( it's kind of weird because the schema looks good to me ). Let me edit my answer to add the dataframe creation code.

Raja Sabarish PV Over a year ago

This is how I generate data frame:

import org.apache.spark.sql.functions._ import org.apache.spark.sql.Row import org.apache.spark.sql.types._  val data = Seq(       Row(1, Array((1,2), (2,3),(3,4))),       Row(2, Array((1,2), (2,3),(3,4)))     )  val schema = StructType(       Array( StructField("colA", IntegerType),         StructField("newCol", ArrayType(StructType(Array(           StructField("value", IntegerType),           StructField("count", IntegerType)))))))  val df = spark.createDataFrame(       spark.sparkContext.parallelize(data),       schema)

Goldie Over a year ago

@RajaSabarish It's failing due to incompatibility between Scala's tuple and Spark struct datastructure. Yours dataframe is failing before running any transformation on it, when trying to show the data ( df.show ). You might want to look for other ways ( using case classes or using toDF like i did) to create the dataframe. Refer to this link: stackoverflow.com/questions/42805510/…

Collectives™ on Stack Overflow

Explode Array[(Int, Int)] column from Spark Dataframe in Scala

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related