0

I want to explode Array[(Int, Int)]

INPUT:

colA newCol
1     [[11, 12],[13, 15]]
2     [[17, 91], [51, 72]]

OUTPUT:

colA newCol
1     11
1     13
2     17
2     51

My Schema looks like this:

 |-- colA: integer (nullable = true)
 |-- newCol: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- value: integer (nullable = true)
 |    |    |-- count: integer (nullable = true)

I've tried something below like this:

val res =  df.withColumn("tup", explode($"newCol")).select("colA", "tup")

res.select(col("colA"), col("tup")("value").as("uId"))
2
  • In your select statement just use tup.value to access the exploded column Commented Aug 1, 2019 at 9:56
  • @Aaron It fails: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2$mcII$sp is not a valid external type for schema of struct<value:int,count:int> Commented Aug 1, 2019 at 10:20

1 Answer 1

1

You can try something like this.

val result = df.withColumn("resultColumn",explode(col("newCol").getItem("value")).select("colA","resultColumn")

so you are basically exploding the array and then taking the first element of the struct.

Edited:

Here is how i created the dataframe with the same schema.

scala> import spark.implicits._

scala> val df = spark.sparkContext.parallelize(List((1),(2))).toDF("id")

scala> val df1 = df.withColumn("col2",array(struct(lit(1),lit(2)),struct(lit(3),lit(4))))

scala> df1.printSchema
root
 |-- id: integer (nullable = false)
 |-- col2: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: integer (nullable = false)
 |    |    |-- col2: integer (nullable = false)


scala> df1.withColumn("resultColumn",explode(col("col2").getItem("col1"))).select("id","resultColumn").show
+---+------------+
| id|resultColumn|
+---+------------+
|  1|           1|
|  1|           3|
|  2|           1|
|  2|           3|
+---+------------+
Sign up to request clarification or add additional context in comments.

4 Comments

It still throws the same exception which i wrote above This is what i did: df.withColumn("resultColumn",explode(col("newCol").getItem("value"))).select("colA","resultColumn").show()
How are you creating the dataframe? It looks like something is wrong with the way you are creating the dataframe ( it's kind of weird because the schema looks good to me ). Let me edit my answer to add the dataframe creation code.
This is how I generate data frame: import org.apache.spark.sql.functions._ import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val data = Seq( Row(1, Array((1,2), (2,3),(3,4))), Row(2, Array((1,2), (2,3),(3,4))) ) val schema = StructType( Array( StructField("colA", IntegerType), StructField("newCol", ArrayType(StructType(Array( StructField("value", IntegerType), StructField("count", IntegerType))))))) val df = spark.createDataFrame( spark.sparkContext.parallelize(data), schema)
@RajaSabarish It's failing due to incompatibility between Scala's tuple and Spark struct datastructure. Yours dataframe is failing before running any transformation on it, when trying to show the data ( df.show ). You might want to look for other ways ( using case classes or using toDF like i did) to create the dataframe. Refer to this link: stackoverflow.com/questions/42805510/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.