0

I am learning Python and trying to make vectorized operations in Pandas in particular. However as I am trying to normalize a Pandas dataframe using Vectorized operations I am getting error messages.

This reproducible example uses the surveys.csv dataset that can be found in this link: http://www.datacarpentry.org/python-ecology-lesson/setup/

   surveys_df = pd.read_csv("surveys.csv")

    surveys_df_normalized = (surveys_df["weight"] - 
surveys_df["weight"].mean())/surveys_df["weight"].std() # Returns NaNs

        surveys_df_normalized = (surveys_df - 
surveys_df.mean())/surveys_df.std() # Returns error

Your advice will be appreciated.

The error message is the following:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\lib\site-packages\pandas\core\ops.py in na_op(x, y)
   1175             result = expressions.evaluate(op, str_rep, x, y,
-> 1176                                           raise_on_error=True, **eval_kwargs)
   1177         except TypeError:

~\lib\site-packages\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
    210         return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error,
--> 211                          **eval_kwargs)
    212     return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error)

~\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs)
    121     if result is None:
--> 122         result = _evaluate_standard(op, op_str, a, b, raise_on_error)
    123 

~\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 

TypeError: unsupported operand type(s) for -: 'str' and 'float'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-88-43990c071f94> in <module>()
----> 1 surveys_df_normalized = (surveys_df - surveys_df.mean())/surveys_df.std()

~\lib\site-packages\pandas\core\ops.py in f(self, other, axis, level, fill_value)
   1234             return self._combine_frame(other, na_op, fill_value, level)
   1235         elif isinstance(other, ABCSeries):
-> 1236             return self._combine_series(other, na_op, fill_value, axis, level)
   1237         else:
   1238             if fill_value is not None:

~\lib\site-packages\pandas\core\frame.py in _combine_series(self, other, func, fill_value, axis, level)
   3504                                                    fill_value=fill_value)
   3505         return self._combine_series_infer(other, func, level=level,
-> 3506                                           fill_value=fill_value)
   3507 
   3508     def _combine_series_infer(self, other, func, level=None, fill_value=None):

~\lib\site-packages\pandas\core\frame.py in _combine_series_infer(self, other, func, level, fill_value)
   3516 
   3517         return self._combine_match_columns(other, func, level=level,
-> 3518                                            fill_value=fill_value)
   3519 
   3520     def _combine_match_index(self, other, func, level=None, fill_value=None):

~\lib\site-packages\pandas\core\frame.py in _combine_match_columns(self, other, func, level, fill_value)
   3536 
   3537         new_data = left._data.eval(func=func, other=right,
-> 3538                                    axes=[left.columns, self.index])
   3539         return self._constructor(new_data)
   3540 

~\lib\site-packages\pandas\core\internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

~\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

~\lib\site-packages\pandas\core\internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1182         try:
   1183             with np.errstate(all='ignore'):
-> 1184                 result = get_result(other)
   1185 
   1186         # if we have an invalid shape/broadcast error

~\lib\site-packages\pandas\core\internals.py in get_result(other)
   1151 
   1152             else:
-> 1153                 result = func(values, other)
   1154 
   1155             # mask if needed

~\lib\site-packages\pandas\core\ops.py in na_op(x, y)
   1181                 result = np.empty(x.size, dtype=dtype)
   1182                 yrav = y.ravel()
-> 1183                 mask = notnull(xrav) & notnull(yrav)
   1184                 xrav = xrav[mask]
   1185 

ValueError: operands could not be broadcast together with shapes (71098,) (2,)
1
  • whats the error message? Commented Nov 25, 2017 at 9:13

1 Answer 1

2

This is because you need to understand dataset. There are null values in weight column. You need to remove columns with null value to normalize weight operation. Make a slice of data as test_data and perform operations.

surveys_df = pd.read_csv("surveys.csv")
test_data = surveys_df.dropna()

Although dropping all null value is not a good practise but for now you can experiment. Now check if test_data have any null values.

test_data.isnull().any()

If all are false, then perform your normalization.

surveys_df_normalized = (test_data["weight"] - test_data["weight"].mean())/test_data["weight"].std()

Now, note that you can't perform your last line of code as it is giving error since you are trying to compute on one dimension ['weight] with whole dataframe survey_df. I hope this helps

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, you are right about the NaNs. But why the line surveys_df_normalized = (surveys_df - surveys_df.mean())/surveys_df.std() returns error? I am not making reference in this line to any particular column, I am trying to make the operation at once in the entire dataframe,i.e. in a vectorized manner.
@rf7 survers_df have strings and float values both whose mean() and sd() doesn't make sense. So, you need to stick to the columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.