10

I have a dataframe, with columns time,a,b,c,d,val. I would like to create a dataframe, with additional column, that will contain the row number of the row, within each group, where a,b,c,d is a group key.

I tried with spark sql, by defining a window function, in particular, in sql it will look like this:

select time, a,b,c,d,val, row_number() over(partition by a,b,c,d order by     time) as rn from table
group by a,b,c,d,val

I would like to do this on the dataframe itslef, without using sparksql.

Thanks

1
  • What to you mean by without using sparksql? Commented May 23, 2016 at 15:53

1 Answer 1

20

I don't know the python api too much, but I will give it a try. You can try something like:

from pyspark.sql import functions as F

df.withColumn("row_number", F.row_number().over(Window.partitionBy("a","b","c","d").orderBy("time"))).show()
Sign up to request clarification or add additional context in comments.

7 Comments

Yes, that's the same as I did, you miss the partition part, df = df.withColumn("id",F.rowNumber().over(Window.partitionBy("a","b","c","d").orderBy(col("time")))), But I would like to do it without that. Thanks
Note, that spark <=1.6 uses different function name rowNumber()
Thanks @laguittemh
@CarlosVilchez is it necessary to use the orderby part ? can we just add the row_number reserving the natural ordering without ordering it ?
@Matthew you may need to create a new question for that. There may be some complexities I don't see right off the top of my head, but you need orderby and probably a new column with the row_number to use it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.