1

I have a data frame df with 100,000 rows using DateTime index. Let the January case as an example. I would like to create a new column named 'Experiment', which may help me to identify when the experiment starts and ends, with 10 experiments in total.

 df=
                            Place      
        Time               
        2021-01-01 00:00    home         
        2021-01-01 00:01    home       
        2021-01-01 00:02    home        
        2021-01-01 00:03    home     
        ................    ....  
        ................    ....
        2021-01-31 23:57    home
        2021-01-31 23:58    home
        2021-01-31 23:59    home

For example, experiment A starts between 2021-01-01 00:00 and 2021-01-01 00:02 and experiment J starts between 2021-01-31 23:57 and 2021-01-31 23:59. the expected results will be like this.

df=
                            Place  Experiment
        Time               
        2021-01-01 00:00    home      A   
        2021-01-01 00:01    home      A 
        2021-01-01 00:02    home      A  
        2021-01-01 00:03    home     
        ................    ....  
        ................    ....
        2021-01-31 23:57    home      J
        2021-01-31 23:58    home      J
        2021-01-31 23:59    home      J

My approach is like this.

df["experiment"] = ""
df["experiment"] = np.where(df.between_time('2021-01-01 00:00','2021-01-01 00:02'),'A',np.nan)
df["experiment"] = np.where(df.between_time('2021-01-31 23:57','2021-01-31 23:59'),'J',np.nan)

And I just realise that the between_time is not working when includes date. Moreover, I am facing the problem that the Length of values does not match length of index.

Thank you!

1 Answer 1

1

Using np.where as you do right now would override what you already created.

For multiple conditions, use .loc to update:

# the experiment time
list_starts = ['2021-01-01 00:00','2021-01-31 23:57']
list_ends = ['2021-01-01 00:02', '2021-01-31 23:59']
list_names = ['A','J']

for start_time, end_time, name in zip(list_starts, list_ends, list_names):
    df.loc[start_time:end_time, 'experiment'] = name

Another (better) way to organize your experiment time can be:

# name: (start, end)
exp_times = {
    'A': ('2021-01-01 00:00', '2021-01-01 00:02'),
    'J': ('2021-01-31 23:57', '2021-01-31 23:59')
}

for name, (start_time, end_time) in exp_times.items():
    df.loc[start_time:end_time, 'experiment'] = name

Output:

                    Place experiment
Time                                
2021-01-01 00:00:00  home          A
2021-01-01 00:01:00  home          A
2021-01-01 00:02:00  home          A
2021-01-01 00:03:00  home        NaN
2021-01-31 23:57:00  home          J
2021-01-31 23:58:00  home          J
2021-01-31 23:59:00  home          J

Note: As you may have noticed, you can use strings to slice/index a time-indexed dataframe.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! The first way is not working for me but the second way is working. Btw, your answer of the first way has missed the 's' for the list_end and list_name
@ahsojai thanks, updated the answer. I think the two solutions are different by only how you organize the data. But I'm glad that at least one of them works for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.