Modify pandas dataframe in python based on multiple rows

Question

I am working with a DataFrame in Pandas / Python, each row has an ID (that is not unique), I would like to modify the dataframe to add a column with the secondname for each row that has multiple matching ID's.

Starting with:

   ID Name  Rate
0   1    A  65.5
1   2    B  67.3
2   2    C  78.8
3   3    D  65.0
4   4    E  45.3
5   5    F  52.0
6   5    G  66.0
7   6    H  34.0
8   7    I   2.0

Trying to get to:

   ID Name  Rate Secondname
0   1    A  65.5       None
1   2    B  67.3       C
2   2    C  78.8       B
3   3    D  65.0       None
4   4    E  45.3       None
5   5    F  52.0       G
6   5    G  66.0       F
7   6    H  34.0       None
8   7    I   2.0       None

My code:

import numpy as np
import pandas as pd


mydict = {'ID':[1,2,2,3,4,5,5,6,7],
             'Name':['A','B','C','D','E','F','G','H','I'],
             'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}

df=pd.DataFrame(mydict)

df['Newname']='None'

for i in range(0, df.shape[0]-1):
    if df.irow(i)['ID']==df.irow(i+1)['ID']:       
        df.irow(i)['Newname']=df.irow(i+1)['Name']

Which results in the following error:

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.irow(i)['Newname']=df.irow(i+1)['Secondname']
C:\Users\L\Anaconda3\lib\site-packages\pandas\core\series.py:664:     SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas- docs/stable/indexing.html#indexing-view-versus-copy
self.loc[key] = value

Any help would be much appreciated.

Community · Accepted Answer · 2017-05-23 11:50:33Z

You can use groupby with custom function f, which use shift and combine_first:

def f(x):
    #print x
    x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
    return x

print df.groupby('ID').apply(f)
   ID Name  Rate Secondname
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN

You can avoid groupby and find duplicated, then fill helper columns by loc with column Name, then shift and combine_first and last drop helper columns:

print df.duplicated('ID', keep='first')
0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
dtype: bool   
print df.duplicated('ID', keep='last')
0    False
1     True
2    False
3    False
4    False
5     True
6    False
7    False
8    False
dtype: bool  
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
print df
   ID Name  Rate   first   last
0   1    A  65.5  NaN  NaN
1   2    B  67.3  NaN    B
2   2    C  78.8    C  NaN
3   3    D  65.0  NaN  NaN
4   4    E  45.3  NaN  NaN
5   5    F  52.0  NaN    F
6   5    G  66.0    G  NaN
7   6    H  34.0  NaN  NaN
8   7    I   2.0  NaN  NaN

df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
df = df.drop(['first', 'l1'], axis=1)

print df
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN

TESTING: (in time of testing solution of Roman Kh has wrong output)

len(df) = 9:

In [154]: %timeit jez(df1)
100 loops, best of 3: 15 ms per loop

In [155]: %timeit jez2(df2)
100 loops, best of 3: 3.45 ms per loop

In [156]: %timeit rom(df)
100 loops, best of 3: 3.55 ms per loop

len(df) = 90k:

In [158]: %timeit jez(df1)
10 loops, best of 3: 57.1 ms per loop

In [159]: %timeit jez2(df2)
10 loops, best of 3: 36.4 ms per loop

In [160]: %timeit rom(df)
10 loops, best of 3: 40.4 ms per loop

import pandas as pd

mydict = {'ID':[1,2,2,3,4,5,5,6,7],
             'Name':['A','B','C','D','E','F','G','H','I'],
             'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}

df=pd.DataFrame(mydict)
print df


df =  pd.concat([df]*10000).reset_index(drop=True)

df1 = df.copy()
df2 = df.copy()

def jez(df):
    def f(x):
        #print x
        x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
        return x

    return df.groupby('ID').apply(f)


def jez2(df): 
    #print df.duplicated('ID', keep='first')
    #print df.duplicated('ID', keep='last')
    df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
    df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
    #print df

    df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
    df = df.drop(['first', 'last'], axis=1)
    return df



def rom(df):

    # cpIDs = True if the next row has the same ID
    df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
    # fill in the last row (get rid of NaN)
    df.iloc[-1,df.columns.get_loc('cpIDs')] = False
    # ShiftName == Name of the next row
    df['ShiftName'] = df['Name'].shift(-1)
    # fill in SecondName
    df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
    # remove columns
    del df['cpIDs']
    del df['ShiftName']
    return df


print jez(df1)  
print jez2(df2)
print rom(df)

print jez(df1)  
   ID Name  Rate Secondname
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN
print jez2(df2)
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN
print rom(df) 
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8        NaN
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0        NaN
7   6    H  34.0        NaN
8   7    I   2.0        NaN

EDIT:

If there is more duplicated pairs with same names, use shift for creating first and last columns:

df.loc[ df['ID'] == df['ID'].shift(), 'first'] = df['Name']
df.loc[ df['ID'] == df['ID'].shift(-1), 'last'] = df['Name']

Great answer! However, what if there are several duplicated rows? This is a question to OP, though.
Really helpful - thank you. I implemented your group by which worked well. Are there advantages to the duplicated method, or is it simply an alternative? Thank you.
Answer is easy - not groupby is faster. I small dataframes 5 times and in large dataframes 0.6 times as you can see in my testing %timeit jez(df1) vs %timeit jez2(df2)

Roman Kh · Accepted Answer · 2016-02-13 19:12:14Z

0

If your dataframe is sorted by ID, you might add a new column which compares ID of the current row with ID of the next row:

# cpIDs = True if the next row has the same ID
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
# fill in the last row (get rid of NaN)
df.iloc[-1,df.columns.get_loc('cpIDs')] = False
# ShiftName == Name of the next row
df['ShiftName'] = df['Name'].shift(-1)
# fill in SecondName
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
# remove columns
del df['cpIDs']
del df['ShiftName']

Of course, you can shorten the code above, as I intentionally made it longer, but easier to comprehend. Depending on your dataframe size it might be pretty fast (perhaps the fastest) as it does not use any complicated operations.

P.S. As a side note, try to avoid any loops when dealing with dataframes and numpy arrays. Almost always you can find a so called vector solution which operates on the whole array or large ranges, not on individual cells and rows.

answered Feb 13, 2016 at 19:12

Roman Kh

2,7652 gold badges20 silver badges16 bronze badges

2 Comments

jezrael Over a year ago

Please check your answer, because it has wrong output.

Roman Kh Over a year ago

Yes, you're right, I misunderstood the task a little bit. Nevertheless, I'll leave my answer unchanged as your second solution is just great.

Collectives™ on Stack Overflow

Modify pandas dataframe in python based on multiple rows

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related