ValueError Pandas duplicate rows with time sequence using Numpy arange

Question

I am currently working to find a more faster solution than the one in this link. The issue is, when my data reaches a relatively huge amount (e.g. 1M rows), it is considerably slow, especially when it is a second-by-second instead of the minute-by-minute in the original post.

So I am trying to find a more efficient way of doing it using Numpy arange. But I am running into an error

#First- with pd.to_datetime
x = pd.DataFrame({ "ID": np.repeat(df.ID.values, df.time_delta.values),
                        "time": np.arange(pd.to_datetime(df.FROM.values), pd.to_datetime(df.TO.values), np.timedelta64(1,'s'))})
#Second - without pd.to_datetime    
x = pd.DataFrame({ "ID": np.repeat(df.ID.values, df.time_delta.values),
                        "time": np.arange(df.FROM.values, df.TO.values, np.timedelta64(1,'s'))})

The idea here is to repeat the ID for how many seconds from the column FROM to column TO (time_delta). But I keep getting the error ValueError: Could not convert object to NumPy timedelta.

Here is the dtypes for my df,

ID                         object
FROM          datetime64[ns, UTC]
TO            datetime64[ns, UTC]
time_delta                  int64
dtype: object

Can anyone tell me what I am doing wrong?

Thank you in advance.

Can you provide a sample of your input dataframe: something like df.head() or df.head().to_dict()? — jpp
– jpp, Commented Jun 27, 2018 at 11:05

jezrael · Accepted Answer · 2018-06-27 13:03:30Z

You can use:

#convert columns to timedeltas
df['FROM'] = pd.to_timedelta(df['FROM'] + ':00')
df['TO'] = pd.to_timedelta(df['TO'] + ':00')

#for each row create timedelta_range and join together
df1 = (pd.concat([pd.Series(r.ID,
                   pd.timedelta_range(r.FROM,r.TO, freq='1Min')) for r in df.itertuples()])
        .reset_index())

df1.columns = ['time','ID']
print (df1)
       time ID
0  15:30:00  A
1  15:31:00  A
2  15:32:00  A
3  15:33:00  A
4  16:40:00  B
5  16:41:00  B
6  16:42:00  B
7  16:43:00  B
8  16:44:00  B
9  15:20:00  C
10 15:21:00  C
11 15:22:00  C

Numpy solution from this answer changed for timedeltas:

#data from linked question
print (df)
  ID   FROM     TO
0  A  15:30  15:33
1  B  16:40  16:44
2  C  15:20  15:22


#repeat constant
minute = int(60 * 1e9)

#convert both columns to timedeltas and then to numpy arrays
sd = pd.to_timedelta(df['FROM'] + ':00').values
ed = pd.to_timedelta(df['TO'] + ':00').values
dd = ed - sd
#number of repeats
ds = (dd / minute).astype(int) + 1

smins = ds.sum()
cmins = ds.cumsum()
rng = np.arange(smins)
slc = np.roll(cmins % smins, 1)
add = rng - rng[slc].repeat(ds)

#DataFrame constructor
df = pd.DataFrame(dict(
       ID = df.ID.values.repeat(ds),
       time = sd.repeat(ds) + add * minute))

print(df)
   ID     time
0   A 15:30:00
1   A 15:31:00
2   A 15:32:00
3   A 15:33:00
4   B 16:40:00
5   B 16:41:00
6   B 16:42:00
7   B 16:43:00
8   B 16:44:00
9   C 15:20:00
10  C 15:21:00
11  C 15:22:00

Collectives™ on Stack Overflow

ValueError Pandas duplicate rows with time sequence using Numpy arange

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related