0

I have json files of the following structure:

{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}

I want to read several such json files and distinct them based on the "name" column inside names. I tried

df.dropDuplicates(Array("names.name")) 

but it didn't do the magic.

1

3 Answers 3

1

This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.

val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
  .dropDuplicates("DEDUP_KEY")
  .drop("DEDUP_KEY")
Sign up to request clarification or add additional context in comments.

5 Comments

not sure why will this work since it seems that the DEDUP_KEY column will contain the names seperated by a comma, so .dropDuplicates("DEDUP_KEY) won't work correctly, isn't it?
No, the comma delimited is when you have more than one key you want to dedup on (composite key). in your case it will be an additional high level key with the name so you can dedup. Did you give it a try?
yes I have. I looked at the results using .show() and It seems to create a DEDUP_KEY column with the names seperated by a comma. then dropDuplicates doesn't work as expected.
yes, I am sorry. You would first have to explode on names, then dedup. My apologies.
thanks. I replied with what seems to be the solution using explode.
0

just for future reference, the solution looks like

      val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY", 
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")

Comments

0

As an update to existing answer, similar thing can be achieved without explode. We can simply get value of each column and then do the concatenation for generating DEDUPE_KEY

val columns = Seq("names.name")
df.withColumn("DEDUPE_KEY", concat_ws("_", columns.map(att => col(att)):_*))
  .dropDuplicates("DEDUPE_KEY")
  .drop("DEDUPE_KEY")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.