Pandas - adding columns, matching on index

Question

I'm trying to figure out if Panda's, when adding two series together, automatically matches on index or if it simply adds by element position. If its just by position, is there a way to get it to add on index? I've looked at merging, but I'm not very clear if the key in this case can be the index of both...

For example, if I have do DFs:

df1 = index  value
        0      10
        1      12
        2      15
        4      20


df2 = index  value
        0      10
        1      10
        3      10
        4      10

and I want to add df1[total] = df1[value] + df2[value] =

df1 = index  value
        0      20
        1      22
        2      15
        3      10
        4      30

Thanks for your help in advance!

Scott Boston · Accepted Answer · 2017-06-12 18:12:36Z

6

Because of the intrinsic data alignment in pandas, you can use add with fill_value=0 and it will sum these two series based on index alignment.

df1.add(df2,fill_value=0)

Input:

df1 = pd.Series([10]*4,index=[0,1,3,4])

df2 = pd.Series([10,12,15,20], index=[0,1,2,4])

df1.add(df2,fill_value=0)

Output:

0    20.0
1    22.0
2    15.0
3    10.0
4    30.0
dtype: float64

edited Jun 12, 2017 at 18:12

answered Jun 12, 2017 at 17:58

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Gene Burinsky · Accepted Answer · 2024-03-25 17:40:31Z

Just do this:

pd.concat([df1,df2], axis=1).sum(axis=1)

pd.concat will merge the 2(or more) frames and match based on index. sum(axis=1) just sums across the rows.

Here's the working example:

#create the example data
df1 = pd.DataFrame({'index':[0,1,2,4],'value':[10,12,15,20]}).set_index('index')
df2 = pd.DataFrame({'index':[0,1,3,4],'value':[10,10,10,10]}).set_index('index')

The above will give you:

In [7]: pd.concat([df1,df2],axis=1).sum(axis=1)
Out[7]:
index
0    20.0
1    22.0
2    15.0
3    10.0
4    30.0
dtype: float64

Timing Update

A comment wanted to see which solution is faster. tldr, using .add() is about 2x faster than pd.concat, but if you use .align explicitly, it'll be even faster, albeit marginally

# create example dataframe
N=1000000
np.random.seed(666)
df1 = pd.DataFrame({'A':np.random.randint(0,1000, size=N),"B":np.random.randint(0,1000, size=N)}, index=np.arange(0,N,1))
df2 = pd.DataFrame({'A':np.random.randint(0,1000, size=N),"B":np.random.randint(0,1000, size=N)}, index=np.arange(0,N,1))

# punch a few holes to mismatch integers
df1 = df1.loc[np.random.choice(df1.index,int(.8*N), replace=False),:].sort_index()
df2 = df2.loc[np.random.choice(df1.index,int(.8*N), replace=False),:].sort_index()

# define functions
def summatch1(x,y):
    return pd.concat([x,y],axis=1).sum(axis=1)

def summatch2(x,y):
    return x.add(y,fill_value=0)

def summatch3(x,y):
    z = x.merge(y,how='outer',suffixes=['1','2'],left_index=True, right_index=True).fillna(0)
    return z.A1+z.A2

def summatch4(x,y):
    p1,p2 = x.align(y,join='outer', fill_value=0)
    return p1+p2

# time them
%timeit summatch1(df1.A, df2.A)
%timeit summatch2(df1.A, df2.A)
%timeit summatch3(df1.loc[:,['A']], df2.loc[:,['A']])
%timeit summatch4(df1.A, df2.A)

Results

Times are reported in the following order:

pd.concat
df1.add
pd.merge
df1.align

800k rows in df1 and df2

79.7 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
36.6 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
40.2 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
31.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

80k rows in df1 and df2

7.2 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.08 ms ± 90.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.14 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.76 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Will this be faster or slow than @scott Boston answer of adding two series with pad value=0?
@limKaizhuo not exactly timely (pun not intended) but I timed the various options. .add is faster but using .align is even faster, albeit marginally.

Collectives™ on Stack Overflow

Pandas - adding columns, matching on index

2 Answers 2

Comments

Just do this:

Here's the working example:

Timing Update

Results

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Just do this:

Here's the working example:

Timing Update

Results

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related