How to append a pyspark dataframes inside a for loop?

Question

example: I have a pyspark dataframe as:

df=
    x_data  y_data    
    2.5      1.5       
    3.5      8.5
    4.5      89.5
    5.5      20.5

Let's say have some calculation to be done on each column on df which I do inside a for loop. After that my final output should be like this:

df_output= 
       cal_1 cal_2 Cal_3 Cal_4   Datatype
        23    24   34     36       x_data
        12    13   18     90       x_data
        23    54   74     96       x_data
        41    13   38     50       x_data
        53    74   44      6       y_data
        72    23   28     50       y_data
        43    24   44     66       y_data
        41    23   58     30       y_data

How do I append these results calculated on each column into the same pyspark output data frame inside the for loop?

blackbishop · Accepted Answer · 2021-02-17 17:42:54Z

2

You can use functools.reduce to union the list of dataframes created in each iteration.

Something like this :

import functools
from pyspark.sql import DataFrame

output_dfs = []

for c in df.columns:
    # do some calculation
    df_output = _  # calculation result

    output_dfs.append(df_output)

df_output = functools.reduce(DataFrame.union, output_dfs)

answered Feb 17, 2021 at 17:42

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to append a pyspark dataframes inside a for loop?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related