0

I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)

for yr in range(2018, 2023):
  print (yr, df.filter(year(col('first_sale')) <= yr).count())
1
  • Since what you've written has <= yr, each proceeding year will contain the count of the previous year, right? E.g. the count of 2019 will contain 2018's count. So why not group by the year, count, get a smaller pandas dataframe, and then cumsum on the count? Commented Aug 12, 2022 at 17:02

2 Answers 2

1

was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])

df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
  c1 = yr
  c2 = df.filter((year(col('first_sale')) <= yr)).count()
  newRow= spark.createDataFrame([(c1,c2)], schema)
  df2 = df2.union(newRow)
  
display(df2)
Sign up to request clarification or add additional context in comments.

Comments

1

I don't have your data, so I can't test if this works, but how about something like this:

year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()

The view grp[['year', 'cumsum']] should be the same as your for-loop.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.