I have quite large dataset (over 6 million rows with just a few columns). When I try to add two float64 columns (data['C'] = data.A + data.B) it gives me a memory error:
Traceback (most recent call last):
File "01_processData.py", line 354, in <module>
prepareData(snp)
File "01_processData.py", line 161, in prepareData
data['C'] = data.A + data.C
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 480, in wrapper
return_indexers=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/index.py", line 976, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1304, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1345, in _join_non_unique
how=how, sort=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 465, in _get_join_indexers
return join_func(left_group_key, right_group_key, max_groups)
File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError
I understand that this operation uses index to properly calculate output, but it seems inefficient, since by the fact that two columns belong to the same DataFrame they have perfect alignment.
I was able to solve the problem by using
data['C'] = data.A.values + data.B.values
but I wonder if there is a method designed to do this or more elegant solution?
df.info()?