How to create dynamic custom function within a loop in pyspark?

Question

I am facing a code challenge in spark within Azure databricks. I have a dataset as

+----+-------------------+----------+--------+-----+---------------+
|OPID|               Date|BaseAmount|Interval|Cycle|RateIncrease(%)|
+----+-------------------+----------+--------+-----+---------------+
|  O1|2014-07-27 00:00:00|      4375|      12|    2|             2%|
|  O2|2020-12-23 00:00:00|      4975|       7|    3|             5%|
+----+-------------------+----------+--------+-----+---------------+

I need to use a loop function which replicates the rows based on Interval, Cycle and RateIncrease(%). Interval and Cycle fields give me the number of rows to be replicated.

Number of rows = Interval * Cycle.

For OPID O1, the number of rows must be 24 (12 * 2).

After 12 months (12 rows) or after 1 cycle completion, the values of BaseAmount are going to increase by 2% for OPID O1 and this happens after every cycle. which should result in a table as below:

+----+-------------------+----------+--------+-----+---------------+
|OPID|               Date|BaseAmount|Interval|Cycle|RateIncrease(%)|
+----+-------------------+----------+--------+-----+---------------+
|  O1|2014-07-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-08-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-09-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-10-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-11-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-12-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-01-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-02-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-03-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-04-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-05-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-06-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-07-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-08-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-09-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-10-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-11-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-12-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-01-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-02-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-03-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-04-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-05-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-06-27 00:00:00|      4463|      12|    2|             2%|
|  O2|2020-12-23 00:00:00|      4975|       7|    3|             5%|
                                .
                                .
                                .
+----+-------------------+----------+--------+-----+---------------+

I got the initial bits solved thanks to user @mck. How to insert a custom function within For loop in pyspark?

Thank you.

mck · Accepted Answer · 2021-02-12 14:38:37Z

You can explode a sequence of cycles first, in order to get the changed BaseAmount, and then explode a sequence of timestamps as in my previous answer:

import pyspark.sql.functions as F

result = df.withColumn(
    'cycle2',
    F.expr("explode(sequence(0, int(Cycle)-1))")
).withColumn(
    'BaseAmount',
    F.expr("int(round(BaseAmount * (1 + 0.01*cycle2*float(trim('%' from `RateIncrease(%)`)))))")
).withColumn(
    'Date',
    F.expr("""
        explode(
            sequence(
                timestamp(add_months(timestamp(Date), cycle2*(`Interval`))), 
                timestamp(add_months(timestamp(Date), (cycle2+1)*(`Interval`) - 1)),
                interval 1 month
            )
        )
    """)
).drop('cycle2')

result.show(999)
+----+-------------------+----------+--------+-----+---------------+
|OPID|               Date|BaseAmount|Interval|Cycle|RateIncrease(%)|
+----+-------------------+----------+--------+-----+---------------+
|  O1|2014-07-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-08-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-09-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-10-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-11-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2014-12-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-01-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-02-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-03-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-04-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-05-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-06-27 00:00:00|      4375|      12|    2|             2%|
|  O1|2015-07-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-08-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-09-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-10-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-11-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2015-12-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-01-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-02-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-03-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-04-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-05-27 00:00:00|      4463|      12|    2|             2%|
|  O1|2016-06-27 00:00:00|      4463|      12|    2|             2%|
|  O2|2020-12-23 00:00:00|      4975|       7|    3|             5%|
|  O2|2021-01-23 00:00:00|      4975|       7|    3|             5%|
|  O2|2021-02-23 00:00:00|      4975|       7|    3|             5%|
|  O2|2021-03-23 00:00:00|      4975|       7|    3|             5%|
|  O2|2021-04-23 00:00:00|      4975|       7|    3|             5%|
|  O2|2021-05-23 00:00:00|      4975|       7|    3|             5%|
|  O2|2021-06-23 00:00:00|      4975|       7|    3|             5%|
|  O2|2021-07-23 00:00:00|      5224|       7|    3|             5%|
|  O2|2021-08-23 00:00:00|      5224|       7|    3|             5%|
|  O2|2021-09-23 00:00:00|      5224|       7|    3|             5%|
|  O2|2021-10-23 00:00:00|      5224|       7|    3|             5%|
|  O2|2021-11-23 00:00:00|      5224|       7|    3|             5%|
|  O2|2021-12-23 00:00:00|      5224|       7|    3|             5%|
|  O2|2022-01-23 00:00:00|      5224|       7|    3|             5%|
|  O2|2022-02-23 00:00:00|      5473|       7|    3|             5%|
|  O2|2022-03-23 00:00:00|      5473|       7|    3|             5%|
|  O2|2022-04-23 00:00:00|      5473|       7|    3|             5%|
|  O2|2022-05-23 00:00:00|      5473|       7|    3|             5%|
|  O2|2022-06-23 00:00:00|      5473|       7|    3|             5%|
|  O2|2022-07-23 00:00:00|      5473|       7|    3|             5%|
|  O2|2022-08-23 00:00:00|      5473|       7|    3|             5%|
+----+-------------------+----------+--------+-----+---------------+

Thanks so much for providing a clear solution. I would only change one bit in the query to suit my needs. F.expr("int(round(BaseAmount * power(1 + float(RateIncrease),Cycle2)))") as I will be doing a compounding growth

Collectives™ on Stack Overflow

How to create dynamic custom function within a loop in pyspark?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related