1

I have a situation as below. I have a master dataframe DF1. I am processing inside a for-loop to reflect the changes and my pseudo codes are as below.

for Year in [2019, 2020]:
  query_west = query_{Year}
  df_west = spark.sql(query_west)
  df_final = DF1.join(df_west, on['ID'], how='left')

In this case df_final is getting joined with query and getting updated every iteration right? I want that changes to be reflected happening on my main dataframe DF1 every iteration inside the for loop.

Please let me know whether my logic is right. Thanks.

2
  • 1
    unless you have df1 = df_final after 3rd now, you will be creating df_final each iteration and you will only have latest result at the end of loop Commented Mar 2, 2021 at 9:31
  • OK Thanks. Should I add df1 = df_final as 4th line in my code inside the for loop? Commented Mar 2, 2021 at 10:50

1 Answer 1

1

As the comment by @venky__ suggested, you need to add another line DF1 = df_final at the end of the for loop, in order to make sure DF1 is updated in each iteration.

Another way is to use reduce to combine the joins all at once. e.g.

from functools import reduce

dfs = [DF1]
for Year in [2019, 2020]:
  query_west = f'query_{Year}'
  df_west = spark.sql(query_west)
  dfs.append(df_west)

df_final = reduce(lambda x, y: x.join(y, 'ID', 'left'), dfs)

which is equivalent to

df_final = DF1.join(spark.sql('query_2019'), 'ID', 'left').join(spark.sql('query_2020'), 'ID', 'left')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.