3

I am working with a DataFrame in Pandas / Python, each row has an ID (that is not unique), I would like to modify the dataframe to add a column with the secondname for each row that has multiple matching ID's.

Starting with:

   ID Name  Rate
0   1    A  65.5
1   2    B  67.3
2   2    C  78.8
3   3    D  65.0
4   4    E  45.3
5   5    F  52.0
6   5    G  66.0
7   6    H  34.0
8   7    I   2.0

Trying to get to:

   ID Name  Rate Secondname
0   1    A  65.5       None
1   2    B  67.3       C
2   2    C  78.8       B
3   3    D  65.0       None
4   4    E  45.3       None
5   5    F  52.0       G
6   5    G  66.0       F
7   6    H  34.0       None
8   7    I   2.0       None

My code:

import numpy as np
import pandas as pd


mydict = {'ID':[1,2,2,3,4,5,5,6,7],
             'Name':['A','B','C','D','E','F','G','H','I'],
             'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}

df=pd.DataFrame(mydict)

df['Newname']='None'

for i in range(0, df.shape[0]-1):
    if df.irow(i)['ID']==df.irow(i+1)['ID']:       
        df.irow(i)['Newname']=df.irow(i+1)['Name']

Which results in the following error:

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.irow(i)['Newname']=df.irow(i+1)['Secondname']
C:\Users\L\Anaconda3\lib\site-packages\pandas\core\series.py:664:     SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas- docs/stable/indexing.html#indexing-view-versus-copy
self.loc[key] = value

Any help would be much appreciated.

2 Answers 2

4

You can use groupby with custom function f, which use shift and combine_first:

def f(x):
    #print x
    x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
    return x

print df.groupby('ID').apply(f)
   ID Name  Rate Secondname
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN

You can avoid groupby and find duplicated, then fill helper columns by loc with column Name, then shift and combine_first and last drop helper columns:

print df.duplicated('ID', keep='first')
0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
dtype: bool   
print df.duplicated('ID', keep='last')
0    False
1     True
2    False
3    False
4    False
5     True
6    False
7    False
8    False
dtype: bool  
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
print df
   ID Name  Rate   first   last
0   1    A  65.5  NaN  NaN
1   2    B  67.3  NaN    B
2   2    C  78.8    C  NaN
3   3    D  65.0  NaN  NaN
4   4    E  45.3  NaN  NaN
5   5    F  52.0  NaN    F
6   5    G  66.0    G  NaN
7   6    H  34.0  NaN  NaN
8   7    I   2.0  NaN  NaN
df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
df = df.drop(['first', 'l1'], axis=1)
print df
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN

TESTING: (in time of testing solution of Roman Kh has wrong output)

len(df) = 9:

In [154]: %timeit jez(df1)
100 loops, best of 3: 15 ms per loop

In [155]: %timeit jez2(df2)
100 loops, best of 3: 3.45 ms per loop

In [156]: %timeit rom(df)
100 loops, best of 3: 3.55 ms per loop    

len(df) = 90k:

In [158]: %timeit jez(df1)
10 loops, best of 3: 57.1 ms per loop

In [159]: %timeit jez2(df2)
10 loops, best of 3: 36.4 ms per loop

In [160]: %timeit rom(df)
10 loops, best of 3: 40.4 ms per loop
import pandas as pd

mydict = {'ID':[1,2,2,3,4,5,5,6,7],
             'Name':['A','B','C','D','E','F','G','H','I'],
             'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}

df=pd.DataFrame(mydict)
print df


df =  pd.concat([df]*10000).reset_index(drop=True)

df1 = df.copy()
df2 = df.copy()

def jez(df):
    def f(x):
        #print x
        x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
        return x

    return df.groupby('ID').apply(f)


def jez2(df): 
    #print df.duplicated('ID', keep='first')
    #print df.duplicated('ID', keep='last')
    df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
    df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
    #print df

    df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
    df = df.drop(['first', 'last'], axis=1)
    return df



def rom(df):

    # cpIDs = True if the next row has the same ID
    df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
    # fill in the last row (get rid of NaN)
    df.iloc[-1,df.columns.get_loc('cpIDs')] = False
    # ShiftName == Name of the next row
    df['ShiftName'] = df['Name'].shift(-1)
    # fill in SecondName
    df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
    # remove columns
    del df['cpIDs']
    del df['ShiftName']
    return df


print jez(df1)  
print jez2(df2)
print rom(df) 
print jez(df1)  
   ID Name  Rate Secondname
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN
print jez2(df2)
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN
print rom(df) 
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8        NaN
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0        NaN
7   6    H  34.0        NaN
8   7    I   2.0        NaN

EDIT:

If there is more duplicated pairs with same names, use shift for creating first and last columns:

df.loc[ df['ID'] == df['ID'].shift(), 'first'] = df['Name']
df.loc[ df['ID'] == df['ID'].shift(-1), 'last'] = df['Name']
Sign up to request clarification or add additional context in comments.

4 Comments

Great answer! However, what if there are several duplicated rows? This is a question to OP, though.
Really helpful - thank you. I implemented your group by which worked well. Are there advantages to the duplicated method, or is it simply an alternative? Thank you.
Answer is easy - not groupby is faster. I small dataframes 5 times and in large dataframes 0.6 times as you can see in my testing %timeit jez(df1) vs %timeit jez2(df2)
Makes sense. Much appreciated!
0

If your dataframe is sorted by ID, you might add a new column which compares ID of the current row with ID of the next row:

# cpIDs = True if the next row has the same ID
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
# fill in the last row (get rid of NaN)
df.iloc[-1,df.columns.get_loc('cpIDs')] = False
# ShiftName == Name of the next row
df['ShiftName'] = df['Name'].shift(-1)
# fill in SecondName
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
# remove columns
del df['cpIDs']
del df['ShiftName']

Of course, you can shorten the code above, as I intentionally made it longer, but easier to comprehend. Depending on your dataframe size it might be pretty fast (perhaps the fastest) as it does not use any complicated operations.

P.S. As a side note, try to avoid any loops when dealing with dataframes and numpy arrays. Almost always you can find a so called vector solution which operates on the whole array or large ranges, not on individual cells and rows.

2 Comments

Please check your answer, because it has wrong output.
Yes, you're right, I misunderstood the task a little bit. Nevertheless, I'll leave my answer unchanged as your second solution is just great.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.