Python dynamic subset in a loop

Question

I have the following dataframe. Multiple county names in a column, and across the table the dates and the values. Pre-recession max is some max that specific County had over a specific time frame (since not every county experienced same drops in values instantaneously). I need to find out the time between the minimum date which is unique to a row and the date when the value rebounded (when the value in the following column after the column with a minimum value became equal or above the pre-recession Max).

I am new to Python, new to stackoverflow, and spent a week researching online with no success.

Dataframe

Final result

The following code works and evaluates all values in df if they are above 51000. The question is: how to subset the df dynamically? Thanks.

df
revcols = df.columns.values.tolist()
revcols.reverse()
tmpdf=tmpdf= df>51000
final=tmpdf[tmpdf.any(axis=1)].idxmax(axis=1)
final

jezrael · Accepted Answer · 2018-08-12 04:45:16Z

Use:

df = df.set_index(['County','Prerecession Max Value'])

a = df.idxmin(axis=1)
m1 = df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0)
m2 = df.sub(df.index.get_level_values(1), axis=0).ge(0)
b = (m1 & m2).idxmax(axis=1)

d = {'Date of Min': a, 'Date of Max':b}
df = df.assign(**d).reset_index()
print (df)
     County  Prerecession Max Value   2007   2008   2009   2010   2011   2012  \
0  County 1                  100000  90000  81000  72900  65610  70000  80000   
1  County 2                   20000  18000  16000  21000  22000  23000  24000   
2  County 3                   10000   9000   8100   7290   6561   5905   6405   
3  County 4                    6000   6000   4860   4374   4474   4574   6001   

    2013    2014    2015 Date of Min Date of Max  
0  90000  100000  110000        2010        2014  
1  25000   26000   27000        2008        2009  
2   6905   12405   13405        2011        2014  
3   7000    7500    7900        2009        2012

Setup: (changed last value of 2007 column to 6000 for test matching after minimal year value)

import pandas as pd

temp=u"""
County;Prerecession Max Value;2007;2008;2009;2010;2011;2012;2013;2014;2015
County 1;100,000;90,000;81,000;72,900;65,610;70,000;80,000;90,000;100,000;110,000
County 2;20,000;18,000;16,000;21,000;22,000;23,000;24,000;25,000;26,000;27,000
County 3;10,000;9,000;8,100;7,290;6,561;5,905;6,405;6,905;12,405;13,405
County 4;6,000;6,000;4,860;4,374;4,474;4,574;6,001;7,000;7,500;7,900"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", thousands=',')
print (df)
     County  Prerecession Max Value   2007   2008   2009   2010   2011   2012  \
0  County 1                  100000  90000  81000  72900  65610  70000  80000   
1  County 2                   20000  18000  16000  21000  22000  23000  24000   
2  County 3                   10000   9000   8100   7290   6561   5905   6405   
3  County 4                    6000   6000   4860   4374   4474   4574   6001   

    2013    2014    2015  
0  90000  100000  110000  
1  25000   26000   27000  
2   6905   12405   13405  
3   7000    7500    7900

Explanation:

First create MultiIndex with no date columns by DataFrame.set_index:

df = df.set_index(['County','Prerecession Max Value'])
print (df)
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  90000  81000  72900  65610  70000  80000   
County 2 20000                   18000  16000  21000  22000  23000  24000   
County 3 10000                    9000   8100   7290   6561   5905   6405   
County 4 6000                     6000   4860   4374   4474   4574   6001   

                                  2013    2014    2015  
County   Prerecession Max Value                         
County 1 100000                  90000  100000  110000  
County 2 20000                   25000   26000   27000  
County 3 10000                    6905   12405   13405  
County 4 6000                     7000    7500    7900

For minimal Date use DataFrame.idxmin:

print (df.idxmin(axis=1))
County    Prerecession Max Value
County 1  100000                    2010
County 2  20000                     2008
County 3  10000                     2011
County 4  6000                      2009
dtype: object

Then need filter all values after minimal values per rows - first compare by min values by DataFrame.eq:

print (df.eq(df.min(axis=1), axis=0))

                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  False  False  False   True  False  False   
County 2 20000                   False   True  False  False  False  False   
County 3 10000                   False  False  False  False   True  False   
County 4 6000                    False  False   True  False  False  False   

                                  2013   2014   2015  
County   Prerecession Max Value                       
County 1 100000                  False  False  False  
County 2 20000                   False  False  False  
County 3 10000                   False  False  False  
County 4 6000                    False  False  False

Use cumulative sum per rows by DataFrame.cumsum

print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1))
                                 2007  2008  2009  2010  2011  2012  2013  \
County   Prerecession Max Value                                             
County 1 100000                     0     0     0     1     1     1     1   
County 2 20000                      0     1     1     1     1     1     1   
County 3 10000                      0     0     0     0     1     1     1   
County 4 6000                       0     0     1     1     1     1     1   

                                 2014  2015  
County   Prerecession Max Value              
County 1 100000                     1     1  
County 2 20000                      1     1  
County 3 10000                      1     1  
County 4 6000                       1     1

And compare by DataFrame.gt:

print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0))
                                  2007   2008   2009   2010  2011  2012  2013  \
County   Prerecession Max Value                                                 
County 1 100000                  False  False  False   True  True  True  True   
County 2 20000                   False   True   True   True  True  True  True   
County 3 10000                   False  False  False  False  True  True  True   
County 4 6000                    False  False   True   True  True  True  True   

                                 2014  2015  
County   Prerecession Max Value              
County 1 100000                  True  True  
County 2 20000                   True  True  
County 3 10000                   True  True  
County 4 6000                    True  True

Then create another mask - subtract second level of MultiIndex selected by Index.get_level_values and DataFrame.sub:

print (df.index.get_level_values(1))
Int64Index([100000, 20000, 10000, 6000], 
           dtype='int64', name='Prerecession Max Value')

print (df.sub(df.index.get_level_values(1), axis=0))
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                 -10000 -19000 -27100 -34390 -30000 -20000   
County 2 20000                   -2000  -4000   1000   2000   3000   4000   
County 3 10000                   -1000  -1900  -2710  -3439  -4095  -3595   
County 4 6000                        0  -1140  -1626  -1526  -1426      1   

                                  2013  2014   2015  
County   Prerecession Max Value                      
County 1 100000                 -10000     0  10000  
County 2 20000                    5000  6000   7000  
County 3 10000                   -3095  2405   3405  
County 4 6000                     1000  1500   1900

Then compare by DataFrame.ge for >= with 0:

print (df.sub(df.index.get_level_values(1), axis=0).ge(0))
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  False  False  False  False  False  False   
County 2 20000                   False  False   True   True   True   True   
County 3 10000                   False  False  False  False  False  False   
County 4 6000                    True   False  False  False  False   True   

                                  2013  2014  2015  
County   Prerecession Max Value                     
County 1 100000                  False  True  True  
County 2 20000                    True  True  True  
County 3 10000                   False  True  True  
County 4 6000                     True  True  True

Chain both boolean mask by & for AND and get column name of first Trues per row by DataFrame.idxmax:

print ((m1 & m2))
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  False  False  False  False  False  False   
County 2 20000                   False  False   True   True   True   True   
County 3 10000                   False  False  False  False  False  False   
County 4 6000                    False  False  False  False  False   True   

                                  2013  2014  2015  
County   Prerecession Max Value                     
County 1 100000                  False  True  True  
County 2 20000                    True  True  True  
County 3 10000                   False  True  True  
County 4 6000                     True  True  True  

print ((m1 & m2).idxmax(axis=1))
County    Prerecession Max Value
County 1  100000                    2014
County 2  20000                     2009
County 3  10000                     2014
County 4  6000                      2012
dtype: object

Create dictionary of new columns for assign:

d = {'Date of Min': a, 'Date of Max':b}
df = df.assign(**d)
print (df)
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  90000  81000  72900  65610  70000  80000   
County 2 20000                   18000  16000  21000  22000  23000  24000   
County 3 10000                    9000   8100   7290   6561   5905   6405   
County 4 6000                     6000   4860   4374   4474   4574   6001   

                                  2013    2014    2015 Date of Min Date of Max  
County   Prerecession Max Value                                                 
County 1 100000                  90000  100000  110000        2010        2014  
County 2 20000                   25000   26000   27000        2008        2009  
County 3 10000                    6905   12405   13405        2011        2014  
County 4 6000                     7000    7500    7900        2009        2012

And last reset_index for columns from MultiIndex.

Richa Monga · Accepted Answer · 2018-08-12 10:21:42Z

Thanks for posting this question.I have come up with a solution to this problem which has been presented below:

I have created a 'csv' file with sample data provided in the problem statement and named it stack.csv. I have added three new columns in this csv which will hold calculated values for:

MinVal_Year - Year where county has minimum value
Rebound_Year -Year when the value rebounded from pre-recession value
TimeDiff -Time elapsed between minimum value year to rebound year

Initially there are nulls or NaNs in these columns.

Now,we can have a look at the solution I have coded:

#Loading the CSV file into a data frame
df = pd.read_csv('stack.csv')

#Transposing the county and year columns to create a subset in order to fetch minimum value for each year
df_subset=df[['county','2007','2008','2009','2010','2011','2012','2013','2014','2015']]
df_subset_transposed = df_subset.T
df_subset_transposed.rename(columns={0:'county1'}, inplace=True)
df_subset_transposed.rename(columns={1:'county2'}, inplace=True)
df_subset_transposed.rename(columns={2:'county3'}, inplace=True)
df_subset_transposed.rename(columns={3:'county4'}, inplace=True)
df_subset_transposed.drop(['county'],inplace=True)
df_subset_transposed.index.names=['year']
df['MinVal_Year'][df['county']=='county1'] = pd.to_numeric(df_subset_transposed[('county1')]).idxmin() 
df['MinVal_Year'][df['county']=='county2'] = pd.to_numeric(df_subset_transposed[('county2')]).idxmin() 
df['MinVal_Year'][df['county']=='county3'] = pd.to_numeric(df_subset_transposed[('county3')]).idxmin() 
df['MinVal_Year'][df['county']=='county4'] = pd.to_numeric(df_subset_transposed[('county4')]).idxmin() 

#Iterating the main data frame couny wise to fetch which year is the rebound year
j=0
for i in df['county']:
    if df[df['county']==i]['2007'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2007') 
    if df[df['county']==i]['2008'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2008')
    elif df[df['county']==i]['2009'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2009')
    elif df[df['county']==i]['2010'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2010')
    elif df[df['county']==i]['2011'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2011')
    elif df[df['county']==i]['2012'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2012')
    elif df[df['county']==i]['2013'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2013')
    elif df[df['county']==i]['2014'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2014')
    elif df[df['county']==i]['2015'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2015')
    j+=1

#Calculating the time difference of number of years elapse between year of minimum value and rebound year        
df['TimeDiff']=df['Rebound_Year']-pd.to_numeric(df['MinVal_Year'])

Lets' have look at the key columns in resultant data frame:

df[['county','prerecession val','MinVal_Year','Rebound_Year','TimeDiff']]

Hope this end to end tested solution helps you.

Thank you for trying to solve this problem. My original dataset includes 15K of counties/zips (rows) and monthly values (the number of columns is 250 or so). I had to simplify the task before posting here. My vote goes to jezrael as it worked on the whole dataframe without specifying column names one by one. There probably could be a simple tweak to your solution to pass all column names as a list which as a beginner, I wouldn't know how to do. I am surprised my question was looked at and addressed so quickly. I knew stackoverflow is supportive but didn't expect it to be so good!!
Thank you so much for sharing your feedback. Even I agree with you that my solution can be made better and more generic as it is very specific at this point.There is a lot to learn from jezrael's solution.I will work on making my solution more generic and optimized.Keep posting such questions as it really helps to fine tune one's skills.

Collectives™ on Stack Overflow

Python dynamic subset in a loop

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related