1

I am trying to min-max scale a single column in a dataframe.

I am following this: Writing Min-Max scaler function

My code:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

print(df, '\n')

y = df['A'].values


def func(x):
    return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]


df['E'] = func(y)
print(df)

df['E'] is just df['A'] / 100.

Not sure what I am missing, but my result is incorrect.

3 Answers 3

1

IIUC, are you trying to do something like this?

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
print(df, '\n')


def func(x):
    return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]


df_out = df.apply(func).add_prefix('Norm_')
print(df_out)

print(df.join(df_out))

Output:

     A   B   C   D
0   91  59  44   5
1   85  44  57  17
2    6  65  37  46
3   40  50   3  40
4   73  58  47  53
..  ..  ..  ..  ..
95  94  76  22  66
96  70  99  40  59
97  96  84  85  24
98  43  51  59  60
99  31   5  55  89

[100 rows x 4 columns] 

    Norm_A  Norm_B  Norm_C  Norm_D
0     0.93    0.60    0.44    0.05
1     0.87    0.44    0.58    0.17
2     0.06    0.66    0.37    0.47
3     0.41    0.51    0.03    0.41
4     0.74    0.59    0.47    0.54
..     ...     ...     ...     ...
95    0.96    0.77    0.22    0.67
96    0.71    1.00    0.40    0.60
97    0.98    0.85    0.86    0.24
98    0.44    0.52    0.60    0.61
99    0.32    0.05    0.56    0.91

[100 rows x 4 columns]
     A   B   C   D  Norm_A  Norm_B  Norm_C  Norm_D
0   91  59  44   5    0.93    0.60    0.44    0.05
1   85  44  57  17    0.87    0.44    0.58    0.17
2    6  65  37  46    0.06    0.66    0.37    0.47
3   40  50   3  40    0.41    0.51    0.03    0.41
4   73  58  47  53    0.74    0.59    0.47    0.54
..  ..  ..  ..  ..     ...     ...     ...     ...
95  94  76  22  66    0.96    0.77    0.22    0.67
96  70  99  40  59    0.71    1.00    0.40    0.60
97  96  84  85  24    0.98    0.85    0.86    0.24
98  43  51  59  60    0.44    0.52    0.60    0.61
99  31   5  55  89    0.32    0.05    0.56    0.91

[100 rows x 8 columns]
Sign up to request clarification or add additional context in comments.

2 Comments

Since you are generating random integer between 0 and 100, the max is going to be most likely close to 100 and the min probably close to 0, hence your dividing your value by ~100.
Oh... So my function may be working correctly, but since every column has a max of 100 or maybe 99, the end result is df['A'] / 100 or maybe df['A'] / 99. I was just working through it and I whipped up the toy dataframe, never realizing the values themselves are the reason I am seeing the output I am seeing.
1

Also consider that using apply() with a function is typically quite inefficient. Try to use vectorized operations whenever you can...

This is a more efficient expression to normalize each column according to the minimum and maximum for that column:

min = df.min()  # per column
max = df.max()  # per column
df.join(np.round((df - min) / (max - min), 2).add_prefix('Norm_'))

That's much faster than using apply() on a function. For your sample DataFrame:

%timeit df.join(np.round((df - df.min()) / (df.max() - df.min()), 2).add_prefix('Norm_'))
9.89 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

While the version with apply takes about 4x longer:

%timeit df.join(df.apply(func).add_prefix('Norm_'))
45.8 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But this difference grows quickly with the size of the DataFrame. For example, with a DataFrame with size 1,000 x 26, I get 37.2 ms ± 269 µs for the version using vectorized instructions, versus 19.5 s ± 1.82 s for the version using apply, around 500x faster!

Comments

0

Not sure what you are after. Your max and minimum are near known because of the number range.

df.loc[:,'A':'D'].apply(lambda x : x.agg({'min','max'}))

and if all you need is df['E'] is just df['A'] / 100. why not;

df['E']=df['A']/100
y=df['E'].values
y

Please dont mark me down just trying to get some clarity

1 Comment

I am getting df['E'] = df['A'] / 100, but that is not what I am after. Not sure why my function is generating that output.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.