2

I want to create sklearn's train_test_split function for Pyspark. I am using pandas udf for creating this function

This is what I have done.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(dataset):
    
    feature_columns = cols
    label = 'y';
    X = dataset[feature_columns]
    Y = dataset[label]
 
    # splitting the dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
    print(X_train)
 
    return X_train, X_test, y_train, y_test 

I want these dataframes X_train, X_test, y_train, y_test seperately.

I know that udf function is called like this

df.groupby("key").apply(load_dataset).show()

But I dont know what to use in place of key Also, this returns single dataframe and I want four.

7
  • I am 100% certain pyspark has this function already. Commented Jan 28, 2021 at 17:35
  • If you just want to split your dataframes you can use randomSplit Commented Jan 28, 2021 at 17:54
  • But I dont want to use randomsplit, actually I want to use sklearn 's train_test_split function into Pyspark. Commented Jan 28, 2021 at 18:03
  • Can you suggest me that link? @John Stud Commented Jan 28, 2021 at 18:24
  • As far as I know, this is just not possible with pandas_udf. You can't return 4 Spark DataFrames. Please read the docs Pandas Function APIs. Commented Jan 28, 2021 at 19:02

2 Answers 2

2

Actually I had to do the Subsampling. That is why i had to get the four variables returns from train_test_split function. But I concatenated X_test and y_test and returned a single dataframe.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(dataset):
    
    feature_columns = cols
    label = 'y';
    X = dataset[feature_columns]
    Y = dataset[label]
 
    # splitting the dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
    print(X_train)
    
    df_sample = pd.concat([X_test, y_test], axis=1)
 
    return df_sample

And this code is working for me.

Sign up to request clarification or add additional context in comments.

Comments

1

What is wrong with:

df = inputDF.cache()
a,b = df.randomSplit([0.5, 0.5])

For time series where order matters, use:

df = df.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("departure_time")))

train_df = df.where("rank <= .8").drop("rank", "departure_time")

3 Comments

I want to do the subsampling on X_test and y_test dataframes by using sklearn's train_test_split.
Can you offer more insight? Why not generate X_test and y_test like I specified above, and then do the same process again to "subsample"?
I think randomsplit is not great for subsampling that is why I want to create a pandas udf for train_test_split of sklearn so that I can use it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.