1

I have dataframe

    site1   time1   site2   time2   site3   time3   site4   time4   site5   time5   ... time6   site7   time7   site8   time8   site9   time9   site10  time10  target
session_id                                                                                  

21669   56  2013-01-12 08:05:57 55.0    2013-01-12 08:05:57 NaN NaT NaN NaT NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
54843   56  2013-01-12 08:37:23 55.0    2013-01-12 08:37:23 56.0    2013-01-12 09:07:07 55.0    2013-01-12 09:07:09 NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
77292   946 2013-01-12 08:50:13 946.0   2013-01-12 08:50:14 951.0   2013-01-12 08:50:15 946.0   2013-01-12 08:50:15 946.0   2013-01-12 08:50:16 ... 2013-01-12 08:50:16 948.0   2013-01-12 08:50:16 784.0   2013-01-12 08:50:16 949.0   2013-01-12 08:50:17 946.0   2013-01-12 08:50:17 0

I need to count diff between last not NaN time and first time.

Desire output (convert to second)

session_id    diff
 21669         0
 54843        2013-01-12 09:07:09 - 2013-01-12 08:37:23 55.0
 77292        4

I can to it for every pair and next merge that

df['diff1'] = df['time1'] - df['time2']
...

But is any way to do it faster?

1
  • 1
    You could make it easier for people to answer your question if you provided an convenient way to produce the sample data. Commented Oct 30, 2017 at 6:46

3 Answers 3

2
  • I dropped target
  • I split your columns into a pd.MultiIndex
  • Made sure the timestamps were actually timestamps (no need to do this on your end unless you do)
  • groupby 'session_id' then used 'first' and 'last' to get the first and last non-null values.
  • pipe to conveniently pass result to a function that subtracts for me

d = df.drop('target', 1)
a = d.columns.str.extract('([a-z]+)(\d+)', expand=True).values.T
mux = pd.MultiIndex.from_arrays([a[0], a[1].astype(int)])
d.columns = mux

for (c0, c1), col in d.iteritems():
    if c0 == 'time':
        d[(c0, c1)] = pd.to_datetime(col, errors='coerce')

f = lambda d: d['last'].sub(d['first']).dt.total_seconds()
d.time.stack().groupby('session_id').agg(['last', 'first']).pipe(f)

session_id
21669       0.0
54843    1786.0
77292       4.0
dtype: float64
Sign up to request clarification or add additional context in comments.

Comments

2

Use:


a = df.filter(like='time').notnull().iloc[:, ::-1].idxmax(1)
print (a)
0    time2
1    time4
2    time5
dtype: object

df['diff']= pd.Series(df.lookup(df.index,a),index=df.index)
              .sub(df['time1'])
              .dt.total_seconds()
print (df['diff'])
0       0.0
1    1786.0
2       4.0
Name: diff, dtype: float64

numpy alternative:

A = df.filter(like='time')
b =  len(A.columns) - A.notnull().values[:, ::-1].argmax(1) - 1

df['diff'] = pd.Series(A.values[np.arange(len(A)),b]).sub(df['time1']).dt.total_seconds()
print (df['diff'])
0       0.0
1    1786.0
2       4.0
Name: diff, dtype: float64

More general Ken Wei solution - select first and last column by iloc:

df1 = df.filter(like='time')
df['diff']= df1.ffill(1).iloc[:, -1].sub(df1.iloc[:, 0]).dt.total_seconds()
print (df['diff'])
0       0.0
1    1786.0
2       4.0
Name: diff, dtype: float64

Comments

1

Use .ffill() on the dataframe with just the time columns:

df['diff1'] = df.filter(like='time').ffill(axis = 1).time10 - df.time1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.