3

I am a newbie at using Spark dataframes. I am trying to use the pivot method with Spark (Spark version 2.x) and running into the following error:

Py4JError: An error occurred while calling o387.pivot. Trace: py4j.Py4JException: Method pivot([class java.lang.String, class java.lang.String]) does not exist

Even though I have the agg function as first here, I really do not need to apply any aggregation.

My dataframe looks like this:

+-----+-----+----------+-----+
| name|value|      date| time|
+-----+-----+----------+-----+
|name1|100.0|2017-12-01|00:00|
|name1|255.5|2017-12-01|00:15|
|name1|333.3|2017-12-01|00:30|

Expected:

+-----+----------+-----+-----+-----+
| name|      date|00:00|00:15|00:30|
+-----+----------+-----+-----+-----+
|name1|2017-12-01|100.0|255.5|333.3|

The way I am trying:

df = df.groupBy(["name","date"]).pivot(pivot_col="time",values="value").agg(first("value")).show

What is my mistake here?

1 Answer 1

5

The problem is the values="value" parameter in the pivot function. This should be used for a list of actual values to pivot on, not a column name. From the documentation:

values – List of values that will be translated to columns in the output DataFrame.

and an example:

df4.groupBy("year").pivot("course", ["dotNET", "Java"]).sum("earnings").collect()
[Row(year=2012, dotNET=15000, Java=20000), Row(year=2013, dotNET=48000, Java=30000)]

For the example in the question values should be set to ["00:00","00:15", "00:30"]. However, the values argument is often not necessary (but will make the pivot more efficient), so you can simply change to:

df = df.groupBy(["name","date"]).pivot("time").agg(first("value"))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.