Create kind of index in Pyspark with window and row_number

Question

i'm trying to create a index in a dataframe with pyspark, windown and row_number function.

For example:

Original dataframe

Obs: the data are random

Coldata
A
B
C
D
E
F
G
H
I

Expected Dataframe:

Coldata	index
A	1
B	1
C	1
D	2
E	2
F	2
G	3
H	3
I	3

My Code in moment is:

w = Window.orderBy("Coldata")
df_expected= df.withColumn("index",  row_number().over(w))

But this returns 1,2,3,4,5

mck · Accepted Answer · 2021-02-04 14:17:24Z

1

You can calculate (row_number + 2) / 3 and cast to integer:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'index',
    ((F.row_number().over(Window.orderBy('Coldata')) + 2) / 3).cast('int')
)

df2.show()
+-------+-----+
|colData|index|
+-------+-----+
|      A|    1|
|      B|    1|
|      C|    1|
|      D|    2|
|      E|    2|
|      F|    2|
|      G|    3|
|      H|    3|
|      I|    3|
+-------+-----+

answered Feb 4, 2021 at 14:17

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Create kind of index in Pyspark with window and row_number

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related