Function against pandas column not generating expected output

Question

I am trying to min-max scale a single column in a dataframe.

I am following this: Writing Min-Max scaler function

My code:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

print(df, '\n')

y = df['A'].values


def func(x):
    return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]


df['E'] = func(y)
print(df)

df['E'] is just df['A'] / 100.

Not sure what I am missing, but my result is incorrect.

Scott Boston · Accepted Answer · 2020-02-22 02:15:52Z

1

IIUC, are you trying to do something like this?

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))
print(df, '\n')


def func(x):
    return [round((i - min(x)) / (max(x) - min(x)), 2) for i in x]


df_out = df.apply(func).add_prefix('Norm_')
print(df_out)

print(df.join(df_out))

Output:

     A   B   C   D
0   91  59  44   5
1   85  44  57  17
2    6  65  37  46
3   40  50   3  40
4   73  58  47  53
..  ..  ..  ..  ..
95  94  76  22  66
96  70  99  40  59
97  96  84  85  24
98  43  51  59  60
99  31   5  55  89

[100 rows x 4 columns] 

    Norm_A  Norm_B  Norm_C  Norm_D
0     0.93    0.60    0.44    0.05
1     0.87    0.44    0.58    0.17
2     0.06    0.66    0.37    0.47
3     0.41    0.51    0.03    0.41
4     0.74    0.59    0.47    0.54
..     ...     ...     ...     ...
95    0.96    0.77    0.22    0.67
96    0.71    1.00    0.40    0.60
97    0.98    0.85    0.86    0.24
98    0.44    0.52    0.60    0.61
99    0.32    0.05    0.56    0.91

[100 rows x 4 columns]
     A   B   C   D  Norm_A  Norm_B  Norm_C  Norm_D
0   91  59  44   5    0.93    0.60    0.44    0.05
1   85  44  57  17    0.87    0.44    0.58    0.17
2    6  65  37  46    0.06    0.66    0.37    0.47
3   40  50   3  40    0.41    0.51    0.03    0.41
4   73  58  47  53    0.74    0.59    0.47    0.54
..  ..  ..  ..  ..     ...     ...     ...     ...
95  94  76  22  66    0.96    0.77    0.22    0.67
96  70  99  40  59    0.71    1.00    0.40    0.60
97  96  84  85  24    0.98    0.85    0.86    0.24
98  43  51  59  60    0.44    0.52    0.60    0.61
99  31   5  55  89    0.32    0.05    0.56    0.91

[100 rows x 8 columns]

answered Feb 22, 2020 at 2:15

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Scott Boston Over a year ago

Since you are generating random integer between 0 and 100, the max is going to be most likely close to 100 and the min probably close to 0, hence your dividing your value by ~100.

MarkS Over a year ago

Oh... So my function may be working correctly, but since every column has a max of 100 or maybe 99, the end result is df['A'] / 100 or maybe df['A'] / 99. I was just working through it and I whipped up the toy dataframe, never realizing the values themselves are the reason I am seeing the output I am seeing.

filbranden · Accepted Answer · 2020-02-22 05:36:35Z

Also consider that using apply() with a function is typically quite inefficient. Try to use vectorized operations whenever you can...

This is a more efficient expression to normalize each column according to the minimum and maximum for that column:

min = df.min()  # per column
max = df.max()  # per column
df.join(np.round((df - min) / (max - min), 2).add_prefix('Norm_'))

That's much faster than using apply() on a function. For your sample DataFrame:

%timeit df.join(np.round((df - df.min()) / (df.max() - df.min()), 2).add_prefix('Norm_'))
9.89 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

While the version with apply takes about 4x longer:

%timeit df.join(df.apply(func).add_prefix('Norm_'))
45.8 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But this difference grows quickly with the size of the DataFrame. For example, with a DataFrame with size 1,000 x 26, I get 37.2 ms ± 269 µs for the version using vectorized instructions, versus 19.5 s ± 1.82 s for the version using apply, around 500x faster!

wwnde · Accepted Answer · 2020-02-22 02:24:37Z

0

Not sure what you are after. Your max and minimum are near known because of the number range.

df.loc[:,'A':'D'].apply(lambda x : x.agg({'min','max'}))

and if all you need is df['E'] is just df['A'] / 100. why not;

df['E']=df['A']/100
y=df['E'].values
y

Please dont mark me down just trying to get some clarity

answered Feb 22, 2020 at 2:24

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

1 Comment

MarkS Over a year ago

I am getting df['E'] = df['A'] / 100, but that is not what I am after. Not sure why my function is generating that output.

Collectives™ on Stack Overflow

Function against pandas column not generating expected output

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related