65

I have a numpy array where each cell of a specific row represents a value for a feature. I store all of them in an 100*4 matrix.

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09  

Any idea how I can normalize rows of this numpy.array where each value is between 0 and 1?

My desired output is:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)
3
  • 3
    Just to be clear: is it a NumPy array or a Pandas DataFrame? Commented Apr 15, 2015 at 21:52
  • 1
    When programming it's important to be specific: a set is a particular object in Python, and you can't have a set of numpy arrays. Python doesn't have a matrix, but numpy does, and that matrix type isn't the same as a numpy array/ndarray (which is itself different from Python's array type, which is not the same as a list). And none of these are pandas DataFrames.. Commented Apr 15, 2015 at 21:58
  • I do not think this is a complete normalization. I would look at stackoverflow.com/questions/9775765/… for a better definition of normalization. Commented Jan 15, 2017 at 1:36

2 Answers 2

123

If I understand correctly, what you want to do is divide by the maximum value in each column. You can do this easily using broadcasting.

Starting with your example array:

import numpy as np

x = np.array([[1000,  10,   0.5],
              [ 765,   5,  0.35],
              [ 800,   7,  0.09]])

x_normed = x / x.max(axis=0)

print(x_normed)
# [[ 1.     1.     1.   ]
#  [ 0.765  0.5    0.7  ]
#  [ 0.8    0.7    0.18 ]]

x.max(0) takes the maximum over the 0th dimension (i.e. rows). This gives you a vector of size (ncols,) containing the maximum value in each column. You can then divide x by this vector in order to normalize your values such that the maximum value in each column will be scaled to 1.


If x contains negative values you would need to subtract the minimum first:

x_normed = (x - x.min(0)) / x.ptp(0)

Here, x.ptp(0) returns the "peak-to-peak" (i.e. the range, max - min) along axis 0. This normalization also guarantees that the minimum value in each column will be 0.

Sign up to request clarification or add additional context in comments.

7 Comments

I really appreciate your answer, I always have issues dealing with "axis" !
For reductions (i.e. .max(), .min(), .sum(), .mean() etc.), you just need to remember that axis specifies the dimension that you want to "collapse" during the reduction. If you want the maximum for each column, then you need to collapse the the row dimension.
@rawbeans See my update. The reason I divided by the maximum is because that's what the OP showed in their example.
@ali_m, Would you please explain why you are saying "If x contains negative values"? If the minimum of the array is 100 and the maximum is 103, I think you should definitely use your second formula, otherwise your result will not have a 0 offset.
@GalacticKetchup You can easily extend this to reductions over arbitrary axes by passing keepdims=True to the reduction ufunc. This arg prevents the reduction axis from getting "squeezed out" so that broadcasting will still work correctly, e.g. x / x.max(axis=1, keepdims=True).
|
32

You can use sklearn.preprocessing:

from sklearn.preprocessing import normalize
data = np.array([
    [1000, 10, 0.5],
    [765, 5, 0.35],
    [800, 7, 0.09], ])
data = normalize(data, axis=0, norm='max')
print(data)
>>[[ 1.     1.     1.   ]
[ 0.765  0.5    0.7  ]
[ 0.8    0.7    0.18 ]]

1 Comment

Any way to scale the column values between 1 and ``2`? Using MinMaxScaler?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.