1

I find myself trying to modify several dataframes with the same operations again and again. I would like to put all modifications in a function and just call the function with the dataframe name and have all transformations done.

This is the code and all transformations I try to apply for now. When I run it, nothing happens, and the dataframe remains raw.

#create a preprocessing formula so the process can be applied to any dataset (traning and validation and competition)
def preprocessing(df):
    #inspect dataframe
    df.head()

    #check data types in dataframe
    np.unique(df.dtypes).tolist()

    #inspect shape before removing duplicates
    df.shape

    #drop duplicates
    df = df.drop_duplicates()

    #inspect shape again to see change
    df.shape

    #calculate rows that have a mean of 100 to remove them later
    mean100_rows = [i for i in range(len(df)) if df.iloc[i,0:520].values.mean() == 100 ]

    #calculate columns that have a mean of 100 to remove them later
    mean100_cols = [i for i in np.arange(0,520,1) if df.iloc[:,i].values.mean() == 100 ]

    #calculate columns labels that have a mean of 100 to remove them later
    col_labels = [df.columns[i] for i in mean100_cols]

    #delete rows with mean 100
    df.drop(index = mean100_rows, axis=0, inplace=True)

    #delete columns with mean 100
    df.drop(columns=col_labels, axis=1, inplace=True)

    #export columns that have been removed
    pd.Series(col_labels).to_csv('remove_cols.csv')

    #head
    df.head()

    #check size again
    df.shape
3
  • 1
    At the end return df, then do df = preprocessing(df). DataFrames are mutable so you can modify them within a function without returning anything. However, I don't recommend that and many pandas operations return new objects so that will fail. Commented Apr 9, 2019 at 15:33
  • @ALollz Thank you so much, this works wonders! Commented Apr 9, 2019 at 15:37
  • You're also going to need to add print() around the lines like df.shape or else you won't see the output. If you're not printing those lines, they aren't doing anything and can be removed. Commented Apr 9, 2019 at 16:03

1 Answer 1

2

In Python objects passed to functions by reference.

When the following line is executed

df = df.drop_duplicates()

You basically assign new reference to function parameter, but the object outside the function does not change.

I would suggest to change the function so it would return the df object and then assign it's return value to df object outside the function.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.