How to add two columns efficiently in Pandas DataFrame?

Question

I have quite large dataset (over 6 million rows with just a few columns). When I try to add two float64 columns (data['C'] = data.A + data.B) it gives me a memory error:

Traceback (most recent call last):
  File "01_processData.py", line 354, in <module>
    prepareData(snp)
  File "01_processData.py", line 161, in prepareData
    data['C'] = data.A + data.C
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 480, in wrapper
    return_indexers=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/index.py", line 976, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1304, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1345, in _join_non_unique
    how=how, sort=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 465, in _get_join_indexers
    return join_func(left_group_key, right_group_key, max_groups)
  File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError

I understand that this operation uses index to properly calculate output, but it seems inefficient, since by the fact that two columns belong to the same DataFrame they have perfect alignment.

I was able to solve the problem by using

data['C'] = data.A.values + data.B.values

but I wonder if there is a method designed to do this or more elegant solution?

what pandas version, os, 32/64 bit? how much memory, can you show df.info()? — Jeff
– Jeff, Commented May 15, 2014 at 10:26

Jeff · Accepted Answer · 2014-05-15 12:59:45Z

2

I cannot reproduce what you are doing (as it won't hit the alignment issue as the indexes are the same).

In master/0.14 (releasing shortly)

In [2]: df = DataFrame(np.random.randn(6000000,2),columns=['A','C'],index=pd.MultiIndex.from_product([['foo','bar'],range(3000000)]))

In [3]: df.values.nbytes
Out[3]: 96000000

In [4]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 625.839844 MB per loop

However in 0.13.1. (I do remember some optimizations were put in 0.14)

In [3]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 1113.671875 MB per loop

edited May 15, 2014 at 12:59

answered May 15, 2014 at 12:06

Jeff

130k21 gold badges223 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:21:25Z

0

Do you have a hierarchical index set? My python used to crash with that, but reset_index() prior to summing used to help. However, this was not reproduced by others, so this is not a "guaranteed improvement".

See my post on this

edited May 23, 2017 at 12:21

CommunityBot

11 silver badge

answered May 15, 2014 at 7:38

FooBar

16.7k20 gold badges94 silver badges188 bronze badges

Collectives™ on Stack Overflow

How to add two columns efficiently in Pandas DataFrame?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related