select data based on datetime in pandas dataframe

Question

I am trying to create some sort of "functional select" that gives users flexibility to create configuration to select data in pandas dataframes. However I ran into some issues that puzzle me.

The following is a simplified example:

>>> import pandas as pd
>>> df = pd.DataFrame({'date': pd.date_range(start='2020-01-01', periods=4), 'val': [1, 2, 3, 4]})
>>> df
        date  val
0 2020-01-01    1
1 2020-01-02    2
2 2020-01-03    3
3 2020-01-04    4

Question 1: Why do I get different result when I apply the function on the column differently?

>>> import datetime
>>> bydatetime = lambda x : x == datetime.date(2020, 1, 1)
>>> bydatetime(df['date'])
0    False
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bydatetime) # why does this one work?
0     True
1    False
2    False
3    False
Name: date, dtype: bool

However if I use numpy's datetime64 or pandas' Timestamp types to create the lambda function, it would work.

>>> import numpy as np
>>> bynpdatetime = lambda x : x == np.datetime64('2020-01-01')
>>> bynpdatetime(df['date'])
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bynpdatetime)
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> bypdtimestamp = lambda x : x == pd.Timestamp('2020-01-01')
>>> bypdtimestamp(df['date'])
0     True
1    False
2    False
3    False
Name: date, dtype: bool
>>> df['date'].apply(bypdtimestamp)
0     True
1    False
2    False
3    False
Name: date, dtype: bool

So I reverted to use the following simple selection, and using datetime.date didn't work. If datetime.date just wouldn't work, why would df['date'].apply(bydatetime) work?

>>> df[df['date'] == datetime.date(2020, 1, 1)]
Empty DataFrame
Columns: [date, val]
Index: []
>>> df[df['date'] == np.datetime64('2020-01-01')]
        date  val
0 2020-01-01    1
>>> df[df['date'] == pd.Timestamp('2020-01-01')]
        date  val
0 2020-01-01    1

Last but not least, why is the type of the date column datetime64 in the DataFrame but Timestamp when selected one cell? What is exactly the difference between them?

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    4 non-null      datetime64[ns]
 1   val     4 non-null      int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 192.0 bytes
>>>
>>> df['date'][0]
Timestamp('2020-01-01 00:00:00')

I am sure there is something fundamental that I don't understand here. Thank you very much for anything constructive.

ALollz · Accepted Answer · 2020-03-27 15:04:53Z

3

Luckily I have an older version of pandas (0.25) and you get a warning when you do bynpdatetime(df['date']), which explains exactly why you see that behavior. There was a bit of back and forth on how to handle this so seeing this behavior will be highly version specific:

FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the 'datetime.date' is coerced to a datetime. In the future pandas will not coerce, and 'the values will not compare equal to the 'datetime.date'. To retain the current behavior, convert the 'datetime.date' to a datetime with 'pd.Timestamp'.

Datetime functionality in pandas is built upon the np.datetime64 and np.timedelta64 dtypes. You should not use the datetime module as they have made certain choices that are inconsistent with the standard library. All of the unintended behavior is because of this.

To answer the other un-related question. datetime64 is like the array-type, or the concept. That array (in this case a pd.Series) would be made up of scalar timedelta64 objects. This is explained in the documentation

edited Mar 27, 2020 at 15:04

answered Mar 27, 2020 at 14:51

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ALollz Over a year ago

@SergeBallesta It's definitely one of the biggest hurdles when starting out with pandas, I wish they'd make it more explicit and perhaps add warnings about working with datetime objects. Mostly these things crop up with timezones, but the list continues to get longer with updates like this.

Collectives™ on Stack Overflow

select data based on datetime in pandas dataframe

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related