44

How can I flag a row in a dataframe every time a column change its string value?

Ex:

Input

ColumnA   ColumnB
1            Blue
2            Blue
3            Red
4            Red
5            Yellow


#  diff won't work here with strings....  only works in numerical values
dataframe['changed'] = dataframe['ColumnB'].diff()        


ColumnA   ColumnB      changed
1            Blue         0
2            Blue         0
3            Red          1
4            Red          0
5            Yellow       1
1
  • Performance note: It might be better to simply use np.bool type instead of integers. np.bool takes up a single byte. I suppose you could use np.int8 but by default np.int64 or np.int64 (whatever a C long is on your system) is used, I believe... Commented Oct 31, 2016 at 18:58

3 Answers 3

36

I get better performance with ne instead of using the actual != comparison:

df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)

Timings

Using the following setup to produce a larger dataframe:

df = pd.concat([df]*10**5, ignore_index=True) 

I get the following timings:

%timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
10 loops, best of 3: 38.1 ms per loop

%timeit (df.ColumnB != df.ColumnB.shift()).astype(int)
10 loops, best of 3: 77.7 ms per loop

%timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB'])
10 loops, best of 3: 99.6 ms per loop

%timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
10 loops, best of 3: 19.3 ms per loop
Sign up to request clarification or add additional context in comments.

5 Comments

Please can you add timings for (df.ColumnB.ne(df.ColumnB.shift())).astype(int) ?
@jezrael: Added the timing. Using ix to make the first row 0 adds ~1 ms to the timing, so it looks to be fastest that way.
Hi, i am using this answer in my script but it returned me 'SettingWithCopyWarning', do you guys see that? dff['changed'] = dff.col1.ne(dff.col1.shift(1))
@root How do i get the shift of the state count? that is Blue -> Red , Red -> Yellow in the same sequence as the were detected
@root Can i directly know the change in state from Blue to Yellow in spite of having Red in the middle?
9

Use .shift and compare:

dataframe['changed'] = dataframe['ColumnB'] == dataframe['ColumnB'].shift(1).fillna(dataframe['ColumnB'])

1 Comment

very clean answer
8

For me works compare with shift, then NaN was replaced 0 because before no value:

df['diff'] = (df.ColumnB != df.ColumnB.shift()).astype(int)
df.ix[0,'diff'] = 0
print (df)
   ColumnA ColumnB  diff
0        1    Blue     0
1        2    Blue     0
2        3     Red     1
3        4     Red     0
4        5  Yellow     1

Edit by timings of another answer - fastest is use ne:

df['diff'] = (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
df.ix[0,'diff'] = 0

3 Comments

I wonder, is there a performance difference between this approach and simply using !=?
@jezrael That how to do the same thing based on two columns?
@Navroop - do you think df[['ColumnA','ColumnB']].ne(df[['ColumnA','ColumnB']].shift()).any(axis=1).astype(int) ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.