0

example: I have a pyspark dataframe as:

df=
    x_data  y_data    
    2.5      1.5       
    3.5      8.5
    4.5      89.5
    5.5      20.5

Let's say have some calculation to be done on each column on df which I do inside a for loop. After that my final output should be like this:

df_output= 
       cal_1 cal_2 Cal_3 Cal_4   Datatype
        23    24   34     36       x_data
        12    13   18     90       x_data
        23    54   74     96       x_data
        41    13   38     50       x_data
        53    74   44      6       y_data
        72    23   28     50       y_data
        43    24   44     66       y_data
        41    23   58     30       y_data

How do I append these results calculated on each column into the same pyspark output data frame inside the for loop?

1 Answer 1

2

You can use functools.reduce to union the list of dataframes created in each iteration.

Something like this :

import functools
from pyspark.sql import DataFrame

output_dfs = []

for c in df.columns:
    # do some calculation
    df_output = _  # calculation result

    output_dfs.append(df_output)

df_output = functools.reduce(DataFrame.union, output_dfs)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.