spark dropDuplicates based on json array field

Question

I have json files of the following structure:

{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}

I want to read several such json files and distinct them based on the "name" column inside names. I tried

df.dropDuplicates(Array("names.name"))

but it didn't do the magic.

After you create a dataframe from the json files, your question becomes a duplicate: stackoverflow.com/questions/30248221/… — Remy Kabel
– Remy Kabel, Commented Jun 12, 2017 at 17:55

Alex Naspo · Accepted Answer · 2017-06-12 17:56:11Z

1

This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.

val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
  .dropDuplicates("DEDUP_KEY")
  .drop("DEDUP_KEY")

answered Jun 12, 2017 at 17:56

Alex Naspo

2,1121 gold badge21 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Hagai Over a year ago

not sure why will this work since it seems that the DEDUP_KEY column will contain the names seperated by a comma, so .dropDuplicates("DEDUP_KEY) won't work correctly, isn't it?

Alex Naspo Over a year ago

No, the comma delimited is when you have more than one key you want to dedup on (composite key). in your case it will be an additional high level key with the name so you can dedup. Did you give it a try?

Hagai Over a year ago

yes I have. I looked at the results using .show() and It seems to create a DEDUP_KEY column with the names seperated by a comma. then dropDuplicates doesn't work as expected.

Alex Naspo Over a year ago

yes, I am sorry. You would first have to explode on names, then dedup. My apologies.

Hagai Over a year ago

thanks. I replied with what seems to be the solution using explode.

Hagai · Accepted Answer · 2017-06-12 19:10:16Z

0

just for future reference, the solution looks like

      val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY", 
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")

answered Jun 12, 2017 at 19:10

Hagai

2955 silver badges13 bronze badges

Comments

Ganga Singh · Accepted Answer · 2023-03-30 14:03:55Z

0

As an update to existing answer, similar thing can be achieved without explode. We can simply get value of each column and then do the concatenation for generating DEDUPE_KEY

val columns = Seq("names.name")
df.withColumn("DEDUPE_KEY", concat_ws("_", columns.map(att => col(att)):_*))
  .dropDuplicates("DEDUPE_KEY")
  .drop("DEDUPE_KEY")

answered Mar 30, 2023 at 14:03

Ganga Singh

11 bronze badge

Collectives™ on Stack Overflow

spark dropDuplicates based on json array field

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related