adding values in new column based on indexes with pandas in python

Question

I'm just getting into pandas and I am trying to add a new column to an existing dataframe.

I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.

The code section below illustrates what I mean. The commented part is what I need as an output.

I guess I need the .loc[] function.

Another, minor, question: is it bad practice to have a non-unique indexes?

import pandas as pd

d = {'key':['a',  'b', 'c'], 
     'bar':[1, 2, 3]}

d2 = {'key':['a', 'a', 'b'],
      'other_data':['10', '20', '30']}

df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')

print df2

##    other_data  new_col
##key           
##a            10   1
##a            20   1
##b            30   2

jezrael · Accepted Answer · 2017-08-11 13:22:46Z

9

Use rename index by Series:

df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

Or map:

df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.

edited Aug 11, 2017 at 13:22

answered Aug 11, 2017 at 13:16

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

piRSquared · Accepted Answer · 2017-08-11 13:19:40Z

3

You can use join

df2.join(df.set_index('key'))

    other_data  bar
key                
a           10    1
a           20    1
b           30    2

One way to rename the column in the process

df2.join(df.set_index('key').bar.rename('new'))

    other_data  new
key                
a           10    1
a           20    1
b           30    2

answered Aug 11, 2017 at 13:19

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Comments

Brad Solomon · Accepted Answer · 2017-08-11 13:31:15Z

Another, minor, question: is it bad practice to have a non-unique indexes?

It is not great practice, but depends on your needs and can be okay in some circumstances.

Issue 1: join operations

A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.

For example:

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
         0        1        a        b
a  0.73737  1.49073  0.73737  1.49073
a  0.73737  1.49073 -0.25562 -2.79859
a -0.25562 -2.79859  0.73737  1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583  1.17583 -0.93583  1.17583
b -0.93583  1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583  1.17583

Issue 2: performance

Unique-valued indices make certain operations efficient, as explained in this post.

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

A word on `.loc`

Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
print(df.loc['a'])
         0        1
a  0.73737  1.49073
a -0.25562 -2.79859

Bharath M Shetty · Accepted Answer · 2017-08-11 13:23:28Z

2

With the help of .loc

df2['new'] = df.set_index('key').loc[df2.index]

Output :

   other_data  new
key                
a           10    1
a           20    1
b           30    2

answered Aug 11, 2017 at 13:23

Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

Comments

Zero · Accepted Answer · 2017-08-11 13:14:20Z

1

Using combine_first

In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
     bar other_data
key
a    1.0         10
a    1.0         20
b    2.0         30

Or, using map

In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
    other_data  bar
key
a           10    1
a           20    1
b           30    2

answered Aug 11, 2017 at 13:14

Zero

77.4k22 gold badges153 silver badges153 bronze badges

Collectives™ on Stack Overflow

adding values in new column based on indexes with pandas in python

5 Answers 5

Comments

Comments

Issue 1: join operations

Issue 2: performance

A word on `.loc`

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Issue 1: join operations

Issue 2: performance

A word on .loc

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

A word on `.loc`