how to partition my field value in dataframe on scala

Question

I have a dataframe , which schema is below:

root
|-- school: string (nullable = true)
|-- questionName: string (nullable = true)
|-- difficultyValue: double (nullable = true)

The data is like this:

school   | questionName | difficultyValue
school1  | q1           | 0.32
school1  | q2           | 0.13
school1  | q3           | 0.58
school1  | q4           | 0.67
school1  | q5           | 0.59
school1  | q6           | 0.43
school1  | q7           | 0.31
school1  | q8           | 0.15
school1  | q9           | 0.21
school1  | q10          | 0.92

But now I want to partition field "difficultyValue" according to Its value, and convert this dataframe to a new dataframe which schema is following:

root
|-- school: string (nullable = true)
|-- difficulty1: double (nullable = true)
|-- difficulty2: double (nullable = true)
|-- difficulty3: double (nullable = true)
|-- difficulty4: double (nullable = true)
|-- difficulty5: double (nullable = true)

and new data table is here:

school   | difficulty1 | difficulty2 | difficulty3 | difficulty4 | difficulty5
school1  | 2           | 3           | 3           | 1           |1

The value of field "difficulty1" is the number of "difficultyValue" < 0.2;

The value of field "difficulty2" is the number of "difficultyValue" < 0.4 and "difficultyValue" >= 0.2;

The value of field "difficulty3" is the number of "difficultyValue" < 0.6 and "difficultyValue" >= 0.4;

The value of field "difficulty4" is the number of "difficultyValue" < 0.8 and "difficultyValue" >= 0.6;

The value of field "difficulty5" is the number of "difficultyValue" < 1.0 and "difficultyValue" >= 0.8;

I don't know how to transform it, what am I supposed to do?

clay · Accepted Answer · 2016-08-20 16:03:26Z

1

// First create a test data frame with the schema of your given source.
val df = {
    import org.apache.spark.sql._
    import org.apache.spark.sql.types._
    import scala.collection.JavaConverters._

    val simpleSchema = StructType(
        StructField("school", StringType, false) ::
        StructField("questionName", StringType, false) ::
        StructField("difficultyValue", DoubleType) :: Nil)

    val data = List(
        Row("school1", "q1", 0.32),
        Row("school1", "q2", 0.45),
        Row("school1", "q3", 0.22),
        Row("school1", "q4", 0.12),
        Row("school2", "q1", 0.32),
        Row("school2", "q2", 0.42),
        Row("school2", "q3", 0.52),
        Row("school2", "q4", 0.62)
    )    

    spark.createDataFrame(data.asJava, simpleSchema)
}
// Add a new column that is the 1-5 category.
val df2 = df.withColumn("difficultyCat", floor(col("difficultyValue").multiply(5.0)) + 1)
// groupBy and pivot to get the final view that you want.
// Here, we know 1-5 values before-hand, if you don't you can omit with performance cost.
val df3 = df2.groupBy("school").pivot("difficultyCat", Seq(1, 2, 3, 4, 5)).count()

df3.show()

edited Aug 20, 2016 at 16:03

answered Aug 19, 2016 at 19:13

clay

20.8k34 gold badges132 silver badges216 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

StrongYoung Over a year ago

Hi Clay, your answer is great, Because I have only five columns, so I can specify the list of disticnt values to pivot on , just like this ``` val df3 = df2.groupBy("schoolID").pivot("difficultyCat",Seq(1,2,3,4,5)).count() ```, Thank you so much.

clay Over a year ago

Yes, you are correct. If you know the possible values before-hand, as we do in this scenario, you should pass them to the pivot function for performance reasons. I updated the code in the answer.

Alfredo Gimenez · Accepted Answer · 2016-08-19 18:52:57Z

0

The following function:

def valueToIndex(v: Double): Int = scala.math.ceil(v*5).toInt

Will determine the index you want from the difficulty value, since you just want 5 uniform bins. You can use this function to create a new derived column using withColumn and udf, and then you can use pivot to generate the counts of rows per index.

answered Aug 19, 2016 at 18:52

Alfredo Gimenez

2,2241 gold badge15 silver badges20 bronze badges

Collectives™ on Stack Overflow

how to partition my field value in dataframe on scala

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related