How to create sequential number column in pyspark dataframe?

Question

I would like to create column with sequential numbers in pyspark dataframe starting from specified number. For instance, I want to add column A to my dataframe df which will start from 5 to the length of my dataframe, incrementing by one, so 5, 6, 7, ..., length(df).

Some simple solution using pyspark methods?

Easiest way is probably df = df.rdd.zipWithIndex().toDF(cols + ["index"]).withColumn("index", f.col("index") + 5) where cols = df.columns and f refers to pyspark.sql.functions. But you should ask yourself why you're doing this, bc almost surely there's a better way. DataFrames are inherently unordered, so this operation is not efficient. — pault
– pault, Commented Jul 6, 2018 at 2:07
Thank you! At the end I want to add the final results to Hive table. I have to take max(id) from this table and add new records with id starting from max(id) + 1. — max04
– max04, Commented Jul 6, 2018 at 9:44
I do not think it is possible to get a serial id column in Hive like that. Hive/Spark is intended for parallel processing. Even though the code in my comment works for you and you may be able to come up with a way to achieve your desired result, this is not really a good use case for spark or hive. — pault
– pault, Commented Jul 6, 2018 at 13:34
I handled it by adding new column to my df like this: max(id) + spark_func.row_number().over(Window.orderBy(unique_field_in_my_df) — max04
– max04, Commented Jul 11, 2018 at 9:47

Prathik Kini · Accepted Answer · 2019-08-22 06:12:34Z

5

You can do this using range

df_len = 100
freq =1
ref = spark.range(
    5, df_len, freq
).toDF("id")
ref.show(10)

+---+
| id|
+---+
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
+---+

only showing top 10 rows

edited Aug 22, 2019 at 6:12

Prathik Kini

1,8101 gold badge19 silver badges37 bronze badges

answered Aug 22, 2019 at 3:55

niraj kumar

2093 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Remis Haroon - رامز Over a year ago

The question is to add a "new" column to an existing dataframe

younus · Accepted Answer · 2020-08-20 11:54:34Z

5

Three simple steps:

from pyspark.sql.window import Window

from pyspark.sql.functions import monotonically_increasing_id,row_number

df =df.withColumn("row_idx",row_number().over(Window.orderBy(monotonically_increasing_id())))

answered Aug 20, 2020 at 11:54

younus

4742 gold badges13 silver badges23 bronze badges

Comments

Diaa Al mohamad · Accepted Answer · 2023-04-20 05:15:05Z

2

Although the question was asked long time ago, I though I could share my solution which I found very convenient. Basically to add a column of 1,2,3,... you can simply add first a column with constant value of 1 using "lit"

from pyspark.sql import functions as func
from pyspark.sql.window import Window    
df= df.withColumn("Id", func.lit(1))

Then apply a cumsum (unique_field_in_my_df is in my case a date column. Probably you can also use the index)

windowCumSum = Window.partitionBy().orderBy('unique_field_in_my_df').rowsBetween(Window.unboundedPreceding,0)
df = df.withColumn("Id",func.sum("Id").over(windowCumSum))

answered Apr 20, 2023 at 5:15

Diaa Al mohamad

1611 silver badge6 bronze badges

Comments

gdupont · Accepted Answer · 2021-06-22 07:45:41Z

0

This worked for me. This creates sequential value into the column.

seed = 23
df.withColumn('label', seed+dense_rank().over(Window.orderBy('column')))

edited Jun 22, 2021 at 7:45

gdupont

1,7282 gold badges19 silver badges28 bronze badges

answered Jun 21, 2021 at 7:02

Thrive_11

11 bronze badge

1 Comment

Blue Bird Over a year ago

Does it require data to be in same partition ?

Collectives™ on Stack Overflow

How to create sequential number column in pyspark dataframe?

4 Answers 4

1 Comment

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related