Pandas: diff between columns

Question

I have dataframe

    site1   time1   site2   time2   site3   time3   site4   time4   site5   time5   ... time6   site7   time7   site8   time8   site9   time9   site10  time10  target
session_id                                                                                  

21669   56  2013-01-12 08:05:57 55.0    2013-01-12 08:05:57 NaN NaT NaN NaT NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
54843   56  2013-01-12 08:37:23 55.0    2013-01-12 08:37:23 56.0    2013-01-12 09:07:07 55.0    2013-01-12 09:07:09 NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
77292   946 2013-01-12 08:50:13 946.0   2013-01-12 08:50:14 951.0   2013-01-12 08:50:15 946.0   2013-01-12 08:50:15 946.0   2013-01-12 08:50:16 ... 2013-01-12 08:50:16 948.0   2013-01-12 08:50:16 784.0   2013-01-12 08:50:16 949.0   2013-01-12 08:50:17 946.0   2013-01-12 08:50:17 0

I need to count diff between last not NaN time and first time.

Desire output (convert to second)

session_id    diff
 21669         0
 54843        2013-01-12 09:07:09 - 2013-01-12 08:37:23 55.0
 77292        4

I can to it for every pair and next merge that

df['diff1'] = df['time1'] - df['time2']
...

But is any way to do it faster?

You could make it easier for people to answer your question if you provided an convenient way to produce the sample data. — piRSquared
– piRSquared, Commented Oct 30, 2017 at 6:46

piRSquared · Accepted Answer · 2017-10-30 07:16:21Z

2

I dropped target
I split your columns into a pd.MultiIndex
Made sure the timestamps were actually timestamps (no need to do this on your end unless you do)
groupby 'session_id' then used 'first' and 'last' to get the first and last non-null values.
pipe to conveniently pass result to a function that subtracts for me

d = df.drop('target', 1)
a = d.columns.str.extract('([a-z]+)(\d+)', expand=True).values.T
mux = pd.MultiIndex.from_arrays([a[0], a[1].astype(int)])
d.columns = mux

for (c0, c1), col in d.iteritems():
    if c0 == 'time':
        d[(c0, c1)] = pd.to_datetime(col, errors='coerce')

f = lambda d: d['last'].sub(d['first']).dt.total_seconds()
d.time.stack().groupby('session_id').agg(['last', 'first']).pipe(f)

session_id
21669       0.0
54843    1786.0
77292       4.0
dtype: float64

answered Oct 30, 2017 at 7:16

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2017-10-30 07:31:21Z

Use:

filter column with times
get columns names by last notnull with idxmax
get values by lookup to Series
last sub with total_seconds

a = df.filter(like='time').notnull().iloc[:, ::-1].idxmax(1)
print (a)
0    time2
1    time4
2    time5
dtype: object

df['diff']= pd.Series(df.lookup(df.index,a),index=df.index)
              .sub(df['time1'])
              .dt.total_seconds()
print (df['diff'])
0       0.0
1    1786.0
2       4.0
Name: diff, dtype: float64

numpy alternative:

A = df.filter(like='time')
b =  len(A.columns) - A.notnull().values[:, ::-1].argmax(1) - 1

df['diff'] = pd.Series(A.values[np.arange(len(A)),b]).sub(df['time1']).dt.total_seconds()
print (df['diff'])
0       0.0
1    1786.0
2       4.0
Name: diff, dtype: float64

More general Ken Wei solution - select first and last column by iloc:

df1 = df.filter(like='time')
df['diff']= df1.ffill(1).iloc[:, -1].sub(df1.iloc[:, 0]).dt.total_seconds()
print (df['diff'])
0       0.0
1    1786.0
2       4.0
Name: diff, dtype: float64

Ken Wei · Accepted Answer · 2017-10-30 06:46:15Z

1

Use .ffill() on the dataframe with just the time columns:

df['diff1'] = df.filter(like='time').ffill(axis = 1).time10 - df.time1

answered Oct 30, 2017 at 6:46

Ken Wei

3,1381 gold badge12 silver badges31 bronze badges

Collectives™ on Stack Overflow

Pandas: diff between columns

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related