Vectorized operations in Pandas - Python

Question

I am learning Python and trying to make vectorized operations in Pandas in particular. However as I am trying to normalize a Pandas dataframe using Vectorized operations I am getting error messages.

This reproducible example uses the surveys.csv dataset that can be found in this link: http://www.datacarpentry.org/python-ecology-lesson/setup/

   surveys_df = pd.read_csv("surveys.csv")

    surveys_df_normalized = (surveys_df["weight"] - 
surveys_df["weight"].mean())/surveys_df["weight"].std() # Returns NaNs

        surveys_df_normalized = (surveys_df - 
surveys_df.mean())/surveys_df.std() # Returns error

Your advice will be appreciated.

The error message is the following:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\lib\site-packages\pandas\core\ops.py in na_op(x, y)
   1175             result = expressions.evaluate(op, str_rep, x, y,
-> 1176                                           raise_on_error=True, **eval_kwargs)
   1177         except TypeError:

~\lib\site-packages\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs)
    210         return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error,
--> 211                          **eval_kwargs)
    212     return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error)

~\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs)
    121     if result is None:
--> 122         result = _evaluate_standard(op, op_str, a, b, raise_on_error)
    123 

~\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 

TypeError: unsupported operand type(s) for -: 'str' and 'float'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-88-43990c071f94> in <module>()
----> 1 surveys_df_normalized = (surveys_df - surveys_df.mean())/surveys_df.std()

~\lib\site-packages\pandas\core\ops.py in f(self, other, axis, level, fill_value)
   1234             return self._combine_frame(other, na_op, fill_value, level)
   1235         elif isinstance(other, ABCSeries):
-> 1236             return self._combine_series(other, na_op, fill_value, axis, level)
   1237         else:
   1238             if fill_value is not None:

~\lib\site-packages\pandas\core\frame.py in _combine_series(self, other, func, fill_value, axis, level)
   3504                                                    fill_value=fill_value)
   3505         return self._combine_series_infer(other, func, level=level,
-> 3506                                           fill_value=fill_value)
   3507 
   3508     def _combine_series_infer(self, other, func, level=None, fill_value=None):

~\lib\site-packages\pandas\core\frame.py in _combine_series_infer(self, other, func, level, fill_value)
   3516 
   3517         return self._combine_match_columns(other, func, level=level,
-> 3518                                            fill_value=fill_value)
   3519 
   3520     def _combine_match_index(self, other, func, level=None, fill_value=None):

~\lib\site-packages\pandas\core\frame.py in _combine_match_columns(self, other, func, level, fill_value)
   3536 
   3537         new_data = left._data.eval(func=func, other=right,
-> 3538                                    axes=[left.columns, self.index])
   3539         return self._constructor(new_data)
   3540 

~\lib\site-packages\pandas\core\internals.py in eval(self, **kwargs)
   3195 
   3196     def eval(self, **kwargs):
-> 3197         return self.apply('eval', **kwargs)
   3198 
   3199     def quantile(self, **kwargs):

~\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

~\lib\site-packages\pandas\core\internals.py in eval(self, func, other, raise_on_error, try_cast, mgr)
   1182         try:
   1183             with np.errstate(all='ignore'):
-> 1184                 result = get_result(other)
   1185 
   1186         # if we have an invalid shape/broadcast error

~\lib\site-packages\pandas\core\internals.py in get_result(other)
   1151 
   1152             else:
-> 1153                 result = func(values, other)
   1154 
   1155             # mask if needed

~\lib\site-packages\pandas\core\ops.py in na_op(x, y)
   1181                 result = np.empty(x.size, dtype=dtype)
   1182                 yrav = y.ravel()
-> 1183                 mask = notnull(xrav) & notnull(yrav)
   1184                 xrav = xrav[mask]
   1185 

ValueError: operands could not be broadcast together with shapes (71098,) (2,)

whats the error message?

user1767754
– user1767754

2017-11-25 09:13:27 +00:00
Commented Nov 25, 2017 at 9:13 — user1767754
– user1767754, Commented Nov 25, 2017 at 9:13

Hari_pb · Accepted Answer · 2017-11-25 09:29:33Z

2

This is because you need to understand dataset. There are null values in weight column. You need to remove columns with null value to normalize weight operation. Make a slice of data as test_data and perform operations.

surveys_df = pd.read_csv("surveys.csv")
test_data = surveys_df.dropna()

Although dropping all null value is not a good practise but for now you can experiment. Now check if test_data have any null values.

test_data.isnull().any()

If all are false, then perform your normalization.

surveys_df_normalized = (test_data["weight"] - test_data["weight"].mean())/test_data["weight"].std()

Now, note that you can't perform your last line of code as it is giving error since you are trying to compute on one dimension ['weight] with whole dataframe survey_df. I hope this helps

answered Nov 25, 2017 at 9:29

Hari_pb

7,4564 gold badges49 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rf7 Over a year ago

Thank you, you are right about the NaNs. But why the line surveys_df_normalized = (surveys_df - surveys_df.mean())/surveys_df.std() returns error? I am not making reference in this line to any particular column, I am trying to make the operation at once in the entire dataframe,i.e. in a vectorized manner.

Hari_pb Over a year ago

@rf7 survers_df have strings and float values both whose mean() and sd() doesn't make sense. So, you need to stick to the columns.

Collectives™ on Stack Overflow

Vectorized operations in Pandas - Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related