I want to create sklearn's train_test_split function for Pyspark. I am using pandas udf for creating this function
This is what I have done.
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(dataset):
feature_columns = cols
label = 'y';
X = dataset[feature_columns]
Y = dataset[label]
# splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
print(X_train)
return X_train, X_test, y_train, y_test
I want these dataframes X_train, X_test, y_train, y_test seperately.
I know that udf function is called like this
df.groupby("key").apply(load_dataset).show()
But I dont know what to use in place of
key
Also, this returns single dataframe and I want four.