I find myself trying to modify several dataframes with the same operations again and again. I would like to put all modifications in a function and just call the function with the dataframe name and have all transformations done.
This is the code and all transformations I try to apply for now. When I run it, nothing happens, and the dataframe remains raw.
#create a preprocessing formula so the process can be applied to any dataset (traning and validation and competition)
def preprocessing(df):
#inspect dataframe
df.head()
#check data types in dataframe
np.unique(df.dtypes).tolist()
#inspect shape before removing duplicates
df.shape
#drop duplicates
df = df.drop_duplicates()
#inspect shape again to see change
df.shape
#calculate rows that have a mean of 100 to remove them later
mean100_rows = [i for i in range(len(df)) if df.iloc[i,0:520].values.mean() == 100 ]
#calculate columns that have a mean of 100 to remove them later
mean100_cols = [i for i in np.arange(0,520,1) if df.iloc[:,i].values.mean() == 100 ]
#calculate columns labels that have a mean of 100 to remove them later
col_labels = [df.columns[i] for i in mean100_cols]
#delete rows with mean 100
df.drop(index = mean100_rows, axis=0, inplace=True)
#delete columns with mean 100
df.drop(columns=col_labels, axis=1, inplace=True)
#export columns that have been removed
pd.Series(col_labels).to_csv('remove_cols.csv')
#head
df.head()
#check size again
df.shape
return df, then dodf = preprocessing(df). DataFrames aremutableso you can modify them within a function without returning anything. However, I don't recommend that and many pandas operations return new objects so that will fail.print()around the lines likedf.shapeor else you won't see the output. If you're not printing those lines, they aren't doing anything and can be removed.