3

So my table looks something like this:

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(2024,201,310)| 21
    a     | NY  | b        |(2010,304,312)| 76
    c     | NY  | x        |(2010,304,310)| 11
    a     | NY  | b        |(453,131,235) | 10

I've tried doing, but this does not eliminate the duplicates as the former array is still there (as it should be, I need it for end results).

val df=  df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
      .groupBy(col("customer_1"), col("place"), col("customer_2"))
      .agg(max("vs").alias("vs"))
      .select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))

I would like to group by customer_1, place and customer_2 columns and return only array structs whose last item (-1) is unique with the highest count, any ideas?

Expected output:

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,312)| 76
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(453,131,235) | 10
    c     | NY  | x        |(2010,304,310)| 11
0

1 Answer 1

2

Given that the schema of the dataframe is as

root
 |-- customer_1: string (nullable = true)
 |-- place: string (nullable = true)
 |-- customer_2: string (nullable = true)
 |-- item: array (nullable = true)
 |    |-- element: integer (containsNull = false)
 |-- count: string (nullable = true)

You can apply concat funcations to create temp column for checking duplicate rows as done below

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
    .dropDuplicates("temp")
    .drop("temp")

You should get following output

+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item            |count|
+----------+-----+----------+----------------+-----+
|a         |NY   |b         |[2010, 304, 312]|76   |
|c         |NY   |x         |[2010, 304, 310]|11   |
|a         |NY   |b         |[453, 131, 235] |10   |
|a         |NY   |b         |[2010, 304, 310]|34   |
+----------+-----+----------+----------------+-----+

Struct

Given the schema of dataframe is as

root
 |-- customer_1: string (nullable = true)
 |-- place: string (nullable = true)
 |-- customer_2: string (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: integer (nullable = false)
 |    |-- _3: integer (nullable = false)
 |-- count: string (nullable = true)

We can still do same as above with slight change in getting the third item from the struct as

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
    .dropDuplicates("temp")
    .drop("temp")

Hope the answer is helpful

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.