Fetching distinct values on a column using Spark DataFrame

Question

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?

 import sqlContext.implicits._
 preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)

 preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
   val applicationId = x.getAs[String](ApplicationId)
   val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
   // DO SOME TASK PER applicationId
 })

 preProcessedData.unpersist()

Alberto Bonsanto · Accepted Answer · 2016-08-14 22:22:04Z

71

Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.

For example:

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")

// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct

// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)

// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))

answered Aug 14, 2016 at 22:22

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

s510 · Accepted Answer · 2021-03-11 12:41:15Z

46

In Pyspark try this,

df.select('col_name').distinct().show()

answered Mar 11, 2021 at 12:41

s510

2,88219 silver badges25 bronze badges

1 Comment

Ravindra Over a year ago

This works for scala too.

Powers · Accepted Answer · 2021-01-19 14:23:05Z

This solution demonstrates how to transform data with Spark native functions which are better than UDFs. It also demonstrates how dropDuplicates which is more suitable than distinct for certain queries.

Suppose you have this DataFrame:

+-------+-------------+
|country|    continent|
+-------+-------------+
|  china|         asia|
| brazil|south america|
| france|       europe|
|  china|         asia|
+-------+-------------+

Here's how to take all the distinct countries and run a transformation:

df
  .select("country")
  .distinct
  .withColumn("country", concat(col("country"), lit(" is fun!")))
  .show()

+--------------+
|       country|
+--------------+
|brazil is fun!|
|france is fun!|
| china is fun!|
+--------------+

You can use dropDuplicates instead of distinct if you don't want to lose the continent information:

df
  .dropDuplicates("country")
  .withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
  .show(false)

+-------+-------------+------------------------------------+
|country|continent    |description                         |
+-------+-------------+------------------------------------+
|brazil |south america|brazil is a country in south america|
|france |europe       |france is a country in europe       |
|china  |asia         |china is a country in asia          |
+-------+-------------+------------------------------------+

See here for more information about filtering DataFrames and here for more information on dropping duplicates.

Ultimately, you'll want to wrap your transformation logic in custom transformations that can be chained with the Dataset#transform method.

dropDuplicates allows you to maintain all the column information that are in dataframe but perform distinct on the column that is specified to the dropduplicates command.

Striezel · Accepted Answer · 2021-03-10 15:25:28Z

1

df =  df.select("column1", "column2",....,..,"column N").distinct.[].collect()

in the empty list, you can insert values like [ to_JSON()] if you want the df in a JSON format.

edited Mar 10, 2021 at 15:25

Striezel

3,7867 gold badges28 silver badges41 bronze badges

answered Mar 10, 2021 at 11:04

Chethan Raj

111 bronze badge

Collectives™ on Stack Overflow

Fetching distinct values on a column using Spark DataFrame

4 Answers 4

Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related