2

I have a dataframe , which schema is below:

root
|-- school: string (nullable = true)
|-- questionName: string (nullable = true)
|-- difficultyValue: double (nullable = true)

The data is like this:

school   | questionName | difficultyValue
school1  | q1           | 0.32
school1  | q2           | 0.13
school1  | q3           | 0.58
school1  | q4           | 0.67
school1  | q5           | 0.59
school1  | q6           | 0.43
school1  | q7           | 0.31
school1  | q8           | 0.15
school1  | q9           | 0.21
school1  | q10          | 0.92

But now I want to partition field "difficultyValue" according to Its value, and convert this dataframe to a new dataframe which schema is following:

root
|-- school: string (nullable = true)
|-- difficulty1: double (nullable = true)
|-- difficulty2: double (nullable = true)
|-- difficulty3: double (nullable = true)
|-- difficulty4: double (nullable = true)
|-- difficulty5: double (nullable = true)

and new data table is here:

school   | difficulty1 | difficulty2 | difficulty3 | difficulty4 | difficulty5
school1  | 2           | 3           | 3           | 1           |1

The value of field "difficulty1" is the number of "difficultyValue" < 0.2;

The value of field "difficulty2" is the number of "difficultyValue" < 0.4 and "difficultyValue" >= 0.2;

The value of field "difficulty3" is the number of "difficultyValue" < 0.6 and "difficultyValue" >= 0.4;

The value of field "difficulty4" is the number of "difficultyValue" < 0.8 and "difficultyValue" >= 0.6;

The value of field "difficulty5" is the number of "difficultyValue" < 1.0 and "difficultyValue" >= 0.8;

I don't know how to transform it, what am I supposed to do?

2 Answers 2

1
// First create a test data frame with the schema of your given source.
val df = {
    import org.apache.spark.sql._
    import org.apache.spark.sql.types._
    import scala.collection.JavaConverters._

    val simpleSchema = StructType(
        StructField("school", StringType, false) ::
        StructField("questionName", StringType, false) ::
        StructField("difficultyValue", DoubleType) :: Nil)

    val data = List(
        Row("school1", "q1", 0.32),
        Row("school1", "q2", 0.45),
        Row("school1", "q3", 0.22),
        Row("school1", "q4", 0.12),
        Row("school2", "q1", 0.32),
        Row("school2", "q2", 0.42),
        Row("school2", "q3", 0.52),
        Row("school2", "q4", 0.62)
    )    

    spark.createDataFrame(data.asJava, simpleSchema)
}
// Add a new column that is the 1-5 category.
val df2 = df.withColumn("difficultyCat", floor(col("difficultyValue").multiply(5.0)) + 1)
// groupBy and pivot to get the final view that you want.
// Here, we know 1-5 values before-hand, if you don't you can omit with performance cost.
val df3 = df2.groupBy("school").pivot("difficultyCat", Seq(1, 2, 3, 4, 5)).count()

df3.show()
Sign up to request clarification or add additional context in comments.

2 Comments

Hi Clay, your answer is great, Because I have only five columns, so I can specify the list of disticnt values to pivot on , just like this ``` val df3 = df2.groupBy("schoolID").pivot("difficultyCat",Seq(1,2,3,4,5)).count() ```, Thank you so much.
Yes, you are correct. If you know the possible values before-hand, as we do in this scenario, you should pass them to the pivot function for performance reasons. I updated the code in the answer.
0

The following function:

def valueToIndex(v: Double): Int = scala.math.ceil(v*5).toInt

Will determine the index you want from the difficulty value, since you just want 5 uniform bins. You can use this function to create a new derived column using withColumn and udf, and then you can use pivot to generate the counts of rows per index.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.