How to save pyspark 'for' loop output as a single dataframe?

Question

I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)

for yr in range(2018, 2023):
  print (yr, df.filter(year(col('first_sale')) <= yr).count())

Since what you've written has <= yr, each proceeding year will contain the count of the previous year, right? E.g. the count of 2019 will contain 2018's count. So why not group by the year, count, get a smaller pandas dataframe, and then cumsum on the count? — caring-goat-913
– caring-goat-913, Commented Aug 12, 2022 at 17:02

z3115 · Accepted Answer · 2022-08-12 17:03:33Z

1

was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])

df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
  c1 = yr
  c2 = df.filter((year(col('first_sale')) <= yr)).count()
  newRow= spark.createDataFrame([(c1,c2)], schema)
  df2 = df2.union(newRow)
  
display(df2)

answered Aug 12, 2022 at 17:03

z3115

3911 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

caring-goat-913 · Accepted Answer · 2022-08-12 17:09:50Z

1

I don't have your data, so I can't test if this works, but how about something like this:

year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()

The view grp[['year', 'cumsum']] should be the same as your for-loop.

answered Aug 12, 2022 at 17:09

caring-goat-913

4,0495 gold badges41 silver badges56 bronze badges

Collectives™ on Stack Overflow

How to save pyspark 'for' loop output as a single dataframe?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related