pyspark row number dataframe

Question

I have a dataframe, with columns time,a,b,c,d,val. I would like to create a dataframe, with additional column, that will contain the row number of the row, within each group, where a,b,c,d is a group key.

I tried with spark sql, by defining a window function, in particular, in sql it will look like this:

select time, a,b,c,d,val, row_number() over(partition by a,b,c,d order by     time) as rn from table
group by a,b,c,d,val

I would like to do this on the dataframe itslef, without using sparksql.

Thanks

What to you mean by without using sparksql?

zero323
– zero323

2016-05-23 15:53:11 +00:00
Commented May 23, 2016 at 15:53 — zero323
– zero323, Commented May 23, 2016 at 15:53

Carlos Vilchez · Accepted Answer · 2017-09-07 13:17:31Z

20

I don't know the python api too much, but I will give it a try. You can try something like:

from pyspark.sql import functions as F

df.withColumn("row_number", F.row_number().over(Window.partitionBy("a","b","c","d").orderBy("time"))).show()

edited Sep 7, 2017 at 13:17

answered May 23, 2016 at 14:05

Carlos Vilchez

2,80431 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

matlabit Over a year ago

Yes, that's the same as I did, you miss the partition part, df = df.withColumn("id",F.rowNumber().over(Window.partitionBy("a","b","c","d").orderBy(col("time")))), But I would like to do it without that. Thanks

y.selivonchyk Over a year ago

Note, that spark <=1.6 uses different function name rowNumber()

Carlos Vilchez Over a year ago

Thanks @laguittemh

PolarBear10 Over a year ago

@CarlosVilchez is it necessary to use the orderby part ? can we just add the row_number reserving the natural ordering without ordering it ?

Carlos Vilchez Over a year ago

@Matthew you may need to create a new question for that. There may be some complexities I don't see right off the top of my head, but you need orderby and probably a new column with the row_number to use it.

|

Collectives™ on Stack Overflow

pyspark row number dataframe

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related