Partition a spark dataframe based on column value?

Question

i have a dataframe from a sql source which looks like:

User(id: Long, fname: String, lname: String, country: String)

[1, Fname1, Lname1, Belarus]
[2, Fname2, Lname2, Belgium]
[3, Fname3, Lname3, Austria]
[4, Fname4, Lname4, Australia]

I want to partition and write this data into csv files where each partition is based on initial letter of the country, so Belarus and Belgium should be one in output file, Austria and Australia in other.

koiralo · Accepted Answer · 2017-07-10 16:57:02Z

12

Here is what you can do

import org.apache.spark.sql.functions._
//create a dataframe with demo data
val df = spark.sparkContext.parallelize(Seq(
  (1, "Fname1", "Lname1", "Belarus"),
  (2, "Fname2", "Lname2", "Belgium"),
  (3, "Fname3", "Lname3", "Austria"),
  (4, "Fname4", "Lname4", "Australia")
)).toDF("id", "fname","lname", "country")

//create a new column with the first letter of column
val result = df.withColumn("countryFirst", split($"country", "")(0))

//save the data with partitionby first letter of country 

result.write.partitionBy("countryFirst").format("com.databricks.spark.csv").save("outputpath")

Edited: You can also use the substring which can increase the performance as suggested by Raphel as

substring(Column str, int pos, int len) Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

val result = df.withColumn("firstCountry", substring($"country",1,1))

and then use partitionby with write

Hope this solves your problem!

edited Jul 10, 2017 at 16:57

answered Jul 7, 2017 at 7:34

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jdk2588 Over a year ago

Adding to the question, does doing df.withColumn has performance penalties, or if it could be done in more effective manner ?

Raphael Roth Over a year ago

You could also use spark's substring function instead of split, I think thats more readable

user482963 Over a year ago

can we do this with multiple columns?

Omkar Neogi Over a year ago

Doesn't this add an extra column called "countryFirst" to the output data? Is there a way to not have that column in the output data but still partition data by the "countryFirst column"? A naive approach is to iterate over distinct values of "countryFirst" and write filtered data per distinct value of "countryFirst". This way, you could avoid writing the extra column in the output. I was hoping to do better than this.

Shaido · Accepted Answer · 2017-07-07 07:26:12Z

1

One alternative to solve this problem would be to first create a column containing only the first letter of each country. Having done this step, you could use partitionBy to save each partition to separate files.

dataFrame.write.partitionBy("column").format("com.databricks.spark.csv").save("/path/to/dir/")

answered Jul 7, 2017 at 7:26

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

2 Comments

jdk2588 Over a year ago

This will create partition on column values, so we will have separate files for Belarus and Belgium not in one file.

Shaido Over a year ago

Yes, as I mentioned you need to first create a separate column containing the countries first letter. Then use partitionBy on that column

Collectives™ on Stack Overflow

Partition a spark dataframe based on column value?

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related