5

I would like to create column with sequential numbers in pyspark dataframe starting from specified number. For instance, I want to add column A to my dataframe df which will start from 5 to the length of my dataframe, incrementing by one, so 5, 6, 7, ..., length(df).

Some simple solution using pyspark methods?

4
  • 1
    Easiest way is probably df = df.rdd.zipWithIndex().toDF(cols + ["index"]).withColumn("index", f.col("index") + 5) where cols = df.columns and f refers to pyspark.sql.functions. But you should ask yourself why you're doing this, bc almost surely there's a better way. DataFrames are inherently unordered, so this operation is not efficient. Commented Jul 6, 2018 at 2:07
  • Thank you! At the end I want to add the final results to Hive table. I have to take max(id) from this table and add new records with id starting from max(id) + 1. Commented Jul 6, 2018 at 9:44
  • I do not think it is possible to get a serial id column in Hive like that. Hive/Spark is intended for parallel processing. Even though the code in my comment works for you and you may be able to come up with a way to achieve your desired result, this is not really a good use case for spark or hive. Commented Jul 6, 2018 at 13:34
  • I handled it by adding new column to my df like this: max(id) + spark_func.row_number().over(Window.orderBy(unique_field_in_my_df) Commented Jul 11, 2018 at 9:47

4 Answers 4

5

You can do this using range

df_len = 100
freq =1
ref = spark.range(
    5, df_len, freq
).toDF("id")
ref.show(10)

+---+
| id|
+---+
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
+---+

only showing top 10 rows

Sign up to request clarification or add additional context in comments.

1 Comment

The question is to add a "new" column to an existing dataframe
5

Three simple steps:

from pyspark.sql.window import Window

from pyspark.sql.functions import monotonically_increasing_id,row_number

df =df.withColumn("row_idx",row_number().over(Window.orderBy(monotonically_increasing_id())))

Comments

2

Although the question was asked long time ago, I though I could share my solution which I found very convenient. Basically to add a column of 1,2,3,... you can simply add first a column with constant value of 1 using "lit"

from pyspark.sql import functions as func
from pyspark.sql.window import Window    
df= df.withColumn("Id", func.lit(1))

Then apply a cumsum (unique_field_in_my_df is in my case a date column. Probably you can also use the index)

windowCumSum = Window.partitionBy().orderBy('unique_field_in_my_df').rowsBetween(Window.unboundedPreceding,0)
df = df.withColumn("Id",func.sum("Id").over(windowCumSum))

Comments

0

This worked for me. This creates sequential value into the column.

seed = 23
df.withColumn('label', seed+dense_rank().over(Window.orderBy('column')))

1 Comment

Does it require data to be in same partition ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.