How to return multiple dataframes using @pandas_udf in Pyspark?

Question

I want to create sklearn's train_test_split function for Pyspark. I am using pandas udf for creating this function

This is what I have done.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(dataset):
    
    feature_columns = cols
    label = 'y';
    X = dataset[feature_columns]
    Y = dataset[label]
 
    # splitting the dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
    print(X_train)
 
    return X_train, X_test, y_train, y_test

I want these dataframes X_train, X_test, y_train, y_test seperately.

I know that udf function is called like this

df.groupby("key").apply(load_dataset).show()

But I dont know what to use in place of key Also, this returns single dataframe and I want four.

If you just want to split your dataframes you can use randomSplit — mck
– mck, Commented Jan 28, 2021 at 17:54
But I dont want to use randomsplit, actually I want to use sklearn 's train_test_split function into Pyspark. — shubham jain
– shubham jain, Commented Jan 28, 2021 at 18:03
As far as I know, this is just not possible with pandas_udf. You can't return 4 Spark DataFrames. Please read the docs Pandas Function APIs. — blackbishop
– blackbishop, Commented Jan 28, 2021 at 19:02

shubham jain · Accepted Answer · 2021-02-02 07:18:05Z

2

Actually I had to do the Subsampling. That is why i had to get the four variables returns from train_test_split function. But I concatenated X_test and y_test and returned a single dataframe.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(dataset):
    
    feature_columns = cols
    label = 'y';
    X = dataset[feature_columns]
    Y = dataset[label]
 
    # splitting the dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
    print(X_train)
    
    df_sample = pd.concat([X_test, y_test], axis=1)
 
    return df_sample

And this code is working for me.

answered Feb 2, 2021 at 7:18

shubham jain

591 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

John Stud · Accepted Answer · 2021-01-29 18:00:10Z

1

What is wrong with:

df = inputDF.cache()
a,b = df.randomSplit([0.5, 0.5])

For time series where order matters, use:

df = df.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("departure_time")))

train_df = df.where("rank <= .8").drop("rank", "departure_time")

answered Jan 29, 2021 at 18:00

John Stud

1,8791 gold badge39 silver badges77 bronze badges

3 Comments

shubham jain Over a year ago

I want to do the subsampling on X_test and y_test dataframes by using sklearn's train_test_split.

John Stud Over a year ago

Can you offer more insight? Why not generate X_test and y_test like I specified above, and then do the same process again to "subsample"?

shubham jain Over a year ago

I think randomsplit is not great for subsampling that is why I want to create a pandas udf for train_test_split of sklearn so that I can use it.

Collectives™ on Stack Overflow

How to return multiple dataframes using @pandas_udf in Pyspark?

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related