query a pandas dataframe based in index and datacolumns

Question

I have a Datset that looks like :

data="""cruiseid  year  station  month  day  date        lat        lon         depth_w  taxon                        count  
        AA8704    1987  1        04     13   13-APR-87   35.85      -75.48      18       Centropages_typicus          75343  
        AA8704    1987  1        04     13   13-APR-87   35.85      -75.48      18       Gastropoda                   0  
        AA8704    1987  1        04     13   13-APR-87   35.85      -75.48      18       Calanus_finmarchicus         2340   
        AA8704    1987  1        07     13   13-JUL-87   35.85      -75.48      18       Acartia_spp.                 5616   
        AA8704    1987  1        07     13   13-JUL-87   35.85      -75.48      18       Metridia_lucens              468    
        AA8704    1987  1        08     13   13-AUG-87   35.85      -75.48      18       Evadne_spp.                  0      
        AA8704    1987  1        08     13   13-AUG-87   35.85      -75.48      18       Salpa                        0      
        AA8704    1987  1        08     13   13-AUG-87   35.85      -75.48      18       Oithona_spp.                 468    
"""
datafile = open('data.txt','w')
datafile.write(data)
datafile.close()

I read it into pandas with :

parse = lambda x: dt.datetime.strptime(x, '%d-%m-%Y')
df = pd.read_csv('data.txt',index_col=0, header=False, parse_dates={"Datetime" : [1,3,4]}, skipinitialspace=True, sep=' ', skiprows=0)

How can i generate a subset from this dataframe with all the records in April where the taxon is 'Calanus_finmarchicus' or 'Gastropoda'

I can query the dataframe where taxon is equal to 'Calanus_finmarchicus' or 'Gastropoda' using

df[(df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda')]

But i'm in trouble quering the time, something similar in numy can be like :

import numpy as np
data = np.genfromtxt('data.txt', dtype=[('cruiseid','S6'), ('year','i4'), ('station','i4'), ('month','i4'), ('day','i4'), ('date','S9'), ('lat','f8'), ('lon','f8'), ('depth_w','i8'), ('taxon','S60'), ('count','i8')], skip_header=1)
selection = [np.where((data['taxon']=='Calanus_finmarchicus') | (data['taxon']=='Gastropoda') & ((data['month']==4) | (data['month']==3)))[0]]
data[selection]

Here's a link with a notebook to reproduce the example

alko · Accepted Answer · 2013-11-23 17:28:49Z

5

You can refer to datetime's month attribute:

>>> df.index.month
array([4, 4, 4, 7, 7, 8, 8, 8], dtype=int32)

>>> df[((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda'))
...        & (df.index.month == 4)]

           cruiseid  station       date    lat    lon  depth_w  \
Datetime
1987-04-13   AA8704        1  13-APR-87  35.85 -75.48       18
1987-04-13   AA8704        1  13-APR-87  35.85 -75.48       18

                           taxon  count  Unnamed: 11
Datetime
1987-04-13            Gastropoda      0          NaN
1987-04-13  Calanus_finmarchicus   2340          NaN

edited Nov 23, 2013 at 17:28

answered Nov 23, 2013 at 17:23

alko

48.7k12 gold badges99 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mzzzzzz Over a year ago

What if you have a multi-column index? Can you refer to the columns individually in the filtering expression?

roman · Accepted Answer · 2013-11-23 17:42:36Z

2

As others said, you can use df.index.month to filter by month, but I also suggest to use pandas.Series.isin() to check your taxon condition:

>>> df[df.taxon.isin(['Calanus_finmarchicus', 'Gastropoda']) & (df.index.month == 4)]
           cruiseid  station       date    lat    lon  depth_w  \
Datetime                                                         
1987-04-13   AA8704        1  13-APR-87  35.85 -75.48       18   
1987-04-13   AA8704        1  13-APR-87  35.85 -75.48       18   

                           taxon  count  Unnamed: 11  
Datetime                                              
1987-04-13            Gastropoda      0          NaN  
1987-04-13  Calanus_finmarchicus   2340          NaN

answered Nov 23, 2013 at 17:42

roman

118k30 gold badges205 silver badges209 bronze badges

Comments

Woody Pride · Accepted Answer · 2013-11-23 17:30:17Z

1

Use the month attribute of your index:

df[(df.index.month == 4) & ((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda'))]

answered Nov 23, 2013 at 17:30

Woody Pride

14k10 gold badges51 silver badges64 bronze badges

Comments

epifanio · Accepted Answer · 2013-11-23 17:46:49Z

0

i didn't pay attention on the syntax (brachets order) and on the dataframe.index attributes, this line give me what i was lloking for :

results = df[((df.taxon == 'Calanus_finmarchicus') | (df.taxon == 'Gastropoda')) & (df.index.month==4)]  # [df.index.month==4)]

answered Nov 23, 2013 at 17:46

epifanio

1,3671 gold badge17 silver badges28 bronze badges

Collectives™ on Stack Overflow

query a pandas dataframe based in index and datacolumns

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related