0

I am currently working to find a more faster solution than the one in this link. The issue is, when my data reaches a relatively huge amount (e.g. 1M rows), it is considerably slow, especially when it is a second-by-second instead of the minute-by-minute in the original post.

So I am trying to find a more efficient way of doing it using Numpy arange. But I am running into an error

#First- with pd.to_datetime
x = pd.DataFrame({ "ID": np.repeat(df.ID.values, df.time_delta.values),
                        "time": np.arange(pd.to_datetime(df.FROM.values), pd.to_datetime(df.TO.values), np.timedelta64(1,'s'))})
#Second - without pd.to_datetime    
x = pd.DataFrame({ "ID": np.repeat(df.ID.values, df.time_delta.values),
                        "time": np.arange(df.FROM.values, df.TO.values, np.timedelta64(1,'s'))})

The idea here is to repeat the ID for how many seconds from the column FROM to column TO (time_delta). But I keep getting the error ValueError: Could not convert object to NumPy timedelta.

Here is the dtypes for my df,

ID                         object
FROM          datetime64[ns, UTC]
TO            datetime64[ns, UTC]
time_delta                  int64
dtype: object

Can anyone tell me what I am doing wrong?

Thank you in advance.

1
  • Can you provide a sample of your input dataframe: something like df.head() or df.head().to_dict()? Commented Jun 27, 2018 at 11:05

1 Answer 1

0

You can use:

#convert columns to timedeltas
df['FROM'] = pd.to_timedelta(df['FROM'] + ':00')
df['TO'] = pd.to_timedelta(df['TO'] + ':00')

#for each row create timedelta_range and join together
df1 = (pd.concat([pd.Series(r.ID,
                   pd.timedelta_range(r.FROM,r.TO, freq='1Min')) for r in df.itertuples()])
        .reset_index())

df1.columns = ['time','ID']
print (df1)
       time ID
0  15:30:00  A
1  15:31:00  A
2  15:32:00  A
3  15:33:00  A
4  16:40:00  B
5  16:41:00  B
6  16:42:00  B
7  16:43:00  B
8  16:44:00  B
9  15:20:00  C
10 15:21:00  C
11 15:22:00  C

Numpy solution from this answer changed for timedeltas:

#data from linked question
print (df)
  ID   FROM     TO
0  A  15:30  15:33
1  B  16:40  16:44
2  C  15:20  15:22


#repeat constant
minute = int(60 * 1e9)

#convert both columns to timedeltas and then to numpy arrays
sd = pd.to_timedelta(df['FROM'] + ':00').values
ed = pd.to_timedelta(df['TO'] + ':00').values
dd = ed - sd
#number of repeats
ds = (dd / minute).astype(int) + 1

smins = ds.sum()
cmins = ds.cumsum()
rng = np.arange(smins)
slc = np.roll(cmins % smins, 1)
add = rng - rng[slc].repeat(ds)

#DataFrame constructor
df = pd.DataFrame(dict(
       ID = df.ID.values.repeat(ds),
       time = sd.repeat(ds) + add * minute))

print(df)
   ID     time
0   A 15:30:00
1   A 15:31:00
2   A 15:32:00
3   A 15:33:00
4   B 16:40:00
5   B 16:41:00
6   B 16:42:00
7   B 16:43:00
8   B 16:44:00
9   C 15:20:00
10  C 15:21:00
11  C 15:22:00
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.