pyspark dataframe inside a for loop

Question

I have a situation as below. I have a master dataframe DF1. I am processing inside a for-loop to reflect the changes and my pseudo codes are as below.

for Year in [2019, 2020]:
  query_west = query_{Year}
  df_west = spark.sql(query_west)
  df_final = DF1.join(df_west, on['ID'], how='left')

In this case df_final is getting joined with query and getting updated every iteration right? I want that changes to be reflected happening on my main dataframe DF1 every iteration inside the for loop.

Please let me know whether my logic is right. Thanks.

unless you have df1 = df_final after 3rd now, you will be creating df_final each iteration and you will only have latest result at the end of loop — Equinox
– Equinox, Commented Mar 2, 2021 at 9:31
OK Thanks. Should I add df1 = df_final as 4th line in my code inside the for loop? — Lilly
– Lilly, Commented Mar 2, 2021 at 10:50

mck · Accepted Answer · 2021-03-02 11:04:07Z

1

As the comment by @venky__ suggested, you need to add another line DF1 = df_final at the end of the for loop, in order to make sure DF1 is updated in each iteration.

Another way is to use reduce to combine the joins all at once. e.g.

from functools import reduce

dfs = [DF1]
for Year in [2019, 2020]:
  query_west = f'query_{Year}'
  df_west = spark.sql(query_west)
  dfs.append(df_west)

df_final = reduce(lambda x, y: x.join(y, 'ID', 'left'), dfs)

which is equivalent to

df_final = DF1.join(spark.sql('query_2019'), 'ID', 'left').join(spark.sql('query_2020'), 'ID', 'left')

answered Mar 2, 2021 at 11:04

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark dataframe inside a for loop

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related