Adding missing rows using panda

Question

This question is related to Adding rows per group in pandas / ipython if per group a row is missing, but is a bit more complicated.

I have a table like this:

ID    DEGREE    TERM    STATUS GRADTERM
1     Bachelors 20111   1
1     Bachelors 20116   1
2     Bachelors 20126   1
2     Bachelors 20131   1
2     Bachelors 20141   1
3     Bachelors 20106   1
3     Bachelors 20111   1       20116
3     Masters   20116   1
3     Masters   20121   1
3     Masters   20131   1       20136

What I would like is to turn that into this (when run for term 20151):

ID    DEGREE    TERM    STATUS
1     Bachelors 20111   1
1     Bachelors 20116   1
1     Bachelors 20121   0
1     Bachelors 20126   0
1     Bachelors 20131   0
1     Bachelors 20136   0
1     Bachelors 20141   0
1     Bachelors 20146   0
1     Bachelors 20151   0
2     Bachelors 20126   1
2     Bachelors 20131   1
2     Bachelors 20136   0
2     Bachelors 20141   1
2     Bachelors 20146   0
2     Bachelors 20151   0
3     Bachelors 20106   1
3     Bachelors 20111   1
3     Bachelors 20116   2
3     Bachelors 20121   2
3     Bachelors 20126   2
3     Bachelors 20131   2
3     Bachelors 20136   2
3     Bachelors 20141   2
3     Bachelors 20146   2
3     Bachelors 20151   2
3     Masters   20116   1
3     Masters   20121   1
3     Masters   20126   0
3     Masters   20131   1
3     Masters   20136   2
3     Masters   20141   2
3     Masters   20146   2
3     Masters   20151   2

In each table, STATUS is 0 - Not Enrolled, 1 - Enrolled, and 2 - Graduated. The TERM fields are the year followed by a 1 or 6 for spring or fall.

Missing TERM records should be added for each person between their first record and the current term (which is 20151 in this case). For each added record, assign a STATUS of 0 unless the last existing record has a STATUS of 2 (which carries). That is, a person is enrolled (STATUS=1) or they are not (STATUS=0 or 2).

I am using pandas in Python, but I am new to Python. I have been trying to figure out how the indexing for a DataFrame works, but that is a complete mystery at this point. Any guidance would be greatly appreciated.

Jianxun Li · Accepted Answer · 2015-06-24 16:52:19Z

You can do it this way.

import pandas as pd
# python 3.4 used
import io

# just try to replicate your data. Use your own csv file instead
# =========================================================
csv = 'ID,DEGREE,TERM,STATUS,GRADTERM\n1,Bachelors,20111,1,\n1,Bachelors,20116,1,\n2,Bachelors,20126,1,\n2,Bachelors,20131,1,\n2,Bachelors,20141,1,\n3,Bachelors,20106,1,\n3,Bachelors,20111,1,20116.0\n3,Masters,20116,1,\n3,Masters,20121,1,\n3,Masters,20131,1,20136.0\n'

df = pd.read_csv(io.StringIO(csv)).set_index('ID')
print(df)

       DEGREE   TERM  STATUS  GRADTERM
ID                                    
1   Bachelors  20111       1       NaN
1   Bachelors  20116       1       NaN
2   Bachelors  20126       1       NaN
2   Bachelors  20131       1       NaN
2   Bachelors  20141       1       NaN
3   Bachelors  20106       1       NaN
3   Bachelors  20111       1     20116
3     Masters  20116       1       NaN
3     Masters  20121       1       NaN
3     Masters  20131       1     20136


# two helper functions
# =========================================================

def build_year_term_range(start_term, current_term):
    # assumes start_term current_term in format '20151' alike
    start_year = int(start_term[:4])  # first four are year
    start_term = int(start_term[-1])  # last four is term
    current_year = int(current_term[:4])
    current_term = int(current_term[-1])
    # build a range
    year_rng = np.repeat(np.arange(start_year, current_year+1), 2)
    term_rng = [1, 6] * int(len(year_rng) / 2)
    year_term_rng = [int(str(year) + str(term)) for year, term in zip(year_rng, term_rng)]
    # check whether need to trim the first and last
    if start_term == 6:  # remove the first
        year_term_rng = year_term_rng[1:]
    if current_term == 1:  # remove the last
        year_term_rng = year_term_rng[:-1]

    return year_term_rng

def my_apply_func(group, current_year_term=current_year_term):
    # start of the record 
    start_year_term = str(group['TERM'].iloc[0])  # gives 2001
    year_term_rng = build_year_term_range(start_year_term, current_year_term)
    # manipulate the group
    group = group.reset_index().set_index('TERM')
    # use reindex to populate missing rows
    group = group.reindex(year_term_rng)
    # fillna ID/DEGREE same as previous
    group[['ID', 'DEGREE']] = group[['ID', 'DEGREE']].fillna(method='ffill')  
    # fillna by 0 not enrolled (for now)
    group['STATUS'] = group['STATUS'].fillna(0)
    # shift GRADTERM 1 slot forward, because GRADTERM and TERM are not aligned
    group['GRADTERM'] = group['GRADTERM'].shift(1)
    # check whether has been graduate, convert to int, use cumsum to carry that non-zero entry forward, convert back to boolean
    # might seems non-trivial at first place :)
    group.loc[group['GRADTERM'].notnull().astype(int).cumsum().astype(bool), 'STATUS'] = 2
    # return only relevant columns
    return group['STATUS']


# start processing
# ============================================================
# move ID from index to a normal column
df = df.reset_index()
# please specify the current year term in string
current_year_term = '20151'
# assume ID is your index column
result = df.groupby(['ID', 'DEGREE']).apply(my_apply_func).reset_index()

Out[163]: 
    ID     DEGREE   TERM  STATUS
0    1  Bachelors  20111       1
1    1  Bachelors  20116       1
2    1  Bachelors  20121       0
3    1  Bachelors  20126       0
4    1  Bachelors  20131       0
5    1  Bachelors  20136       0
6    1  Bachelors  20141       0
7    1  Bachelors  20146       0
8    1  Bachelors  20151       0
9    2  Bachelors  20126       1
10   2  Bachelors  20131       1
11   2  Bachelors  20136       0
12   2  Bachelors  20141       1
13   2  Bachelors  20146       0
14   2  Bachelors  20151       0
15   3  Bachelors  20106       1
16   3  Bachelors  20111       1
17   3  Bachelors  20116       2
18   3  Bachelors  20121       2
19   3  Bachelors  20126       2
20   3  Bachelors  20131       2
21   3  Bachelors  20136       2
22   3  Bachelors  20141       2
23   3  Bachelors  20146       2
24   3  Bachelors  20151       2
25   3    Masters  20116       1
26   3    Masters  20121       1
27   3    Masters  20126       0
28   3    Masters  20131       1
29   3    Masters  20136       2
30   3    Masters  20141       2
31   3    Masters  20146       2
32   3    Masters  20151       2

That worked! Thanks. Now I want to add a new column to the original table (e.g., cohort_term). I thought I would just add a new line into the my_apply_func (something like group['COHORT'] = group['COHORT'].fillna(method='ffill')) but that did not work. How do I do that?
How would you like to calculate the value for column 'COHORT'? or just leave it blank? You are on the right track to modify my_apply_func, just make sure it returns both 'COHORT' column and the 'STATUS' column.
Perfect. That worked. I was able to add in all the other columns in which I was interested (there were actually 3 other variables not mentioned above). Your comments are very helpful in understanding how this works. Thanks!

Collectives™ on Stack Overflow

Adding missing rows using panda

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related