Removing duplicate array structs by last item in array struct in Spark Dataframe

Question

So my table looks something like this:

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(2024,201,310)| 21
    a     | NY  | b        |(2010,304,312)| 76
    c     | NY  | x        |(2010,304,310)| 11
    a     | NY  | b        |(453,131,235) | 10

I've tried doing, but this does not eliminate the duplicates as the former array is still there (as it should be, I need it for end results).

val df=  df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
      .groupBy(col("customer_1"), col("place"), col("customer_2"))
      .agg(max("vs").alias("vs"))
      .select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))

I would like to group by customer_1, place and customer_2 columns and return only array structs whose last item (-1) is unique with the highest count, any ideas?

Expected output:

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,312)| 76
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(453,131,235) | 10
    c     | NY  | x        |(2010,304,310)| 11

Anahcolus · Accepted Answer · 2017-08-02 11:42:14Z

Given that the schema of the dataframe is as

root
 |-- customer_1: string (nullable = true)
 |-- place: string (nullable = true)
 |-- customer_2: string (nullable = true)
 |-- item: array (nullable = true)
 |    |-- element: integer (containsNull = false)
 |-- count: string (nullable = true)

You can apply concat funcations to create temp column for checking duplicate rows as done below

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
    .dropDuplicates("temp")
    .drop("temp")

You should get following output

+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item            |count|
+----------+-----+----------+----------------+-----+
|a         |NY   |b         |[2010, 304, 312]|76   |
|c         |NY   |x         |[2010, 304, 310]|11   |
|a         |NY   |b         |[453, 131, 235] |10   |
|a         |NY   |b         |[2010, 304, 310]|34   |
+----------+-----+----------+----------------+-----+

Struct

Given the schema of dataframe is as

root
 |-- customer_1: string (nullable = true)
 |-- place: string (nullable = true)
 |-- customer_2: string (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: integer (nullable = false)
 |    |-- _3: integer (nullable = false)
 |-- count: string (nullable = true)

We can still do same as above with slight change in getting the third item from the struct as

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
    .dropDuplicates("temp")
    .drop("temp")

Hope the answer is helpful

Collectives™ on Stack Overflow

Removing duplicate array structs by last item in array struct in Spark Dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related