1

I used the groupby method from pandas that can group by id and time in this example csv for example:

| id | month | average tree growth (cm)|
|----|-------|-------------------------|
|  1 |   4   |        9                |
|  1 |   5   |        4                |
|  1 |   6   |        7                |
|  2 |   1   |        9                |
|  2 |   2   |        9                |
|  2 |   3   |        8                |
|  2 |   4   |        6                |

However, each id should have 12 months and I will need to fill in the average tree height at that missing month to be null value, like this:

| id | month | average tree growth (cm)|
|----|-------|-------------------------|
|  1 |   1   |        nan              |
|  1 |   2   |        nan              |
|  1 |   3   |        nan              |
|  1 |   4   |        9                |
|  1 |   5   |        4                |
|  1 |   6   |        7                |
|  1 |   7   |        nan              |
|  1 |   8   |        nan              |
|  1 |   9   |        nan              |
|  1 |   10  |        nan              |
|  1 |   11  |        nan              |
|  1 |   12  |        nan              |
|  2 |   1   |        9                |

This is for bokeh plotting purpose, how do I add the missing month to each id and fill the average height to nan in this case using python? Is there any easier way than brute force looping all id and check for months? Any hint would be appreciated!

2 Answers 2

2

One way to do it is by creating MultiIndex and reindex by using pd.MultiIndex.from_product and .reindex(), as follows:

mux = pd.MultiIndex.from_product([df['id'].unique(), np.arange(1, 13)],
                                 names=['id', 'month'])

df.set_index(['id', 'month']).reindex(mux).reset_index()

Result:

    id  month  average tree growth (cm)
0    1      1                       NaN
1    1      2                       NaN
2    1      3                       NaN
3    1      4                       9.0
4    1      5                       4.0
5    1      6                       7.0
6    1      7                       NaN
7    1      8                       NaN
8    1      9                       NaN
9    1     10                       NaN
10   1     11                       NaN
11   1     12                       NaN
12   2      1                       9.0
13   2      2                       9.0
14   2      3                       8.0
15   2      4                       6.0
16   2      5                       NaN
17   2      6                       NaN
18   2      7                       NaN
19   2      8                       NaN
20   2      9                       NaN
21   2     10                       NaN
22   2     11                       NaN
23   2     12                       NaN
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you so much! But do you know why this error occurs? 'Series' object has no attribute 'stack'. I created a new csv after I group the data by id and month. When I read this new csv and apply the method you give, it raises the above error.
@YangZiqi Which one of the 2 above you tried ? Have you used unstack before stack as above ?
@YangZiqi Though I don't understand why you got the error using my solution (I tested it with your sample data without problem), I have provided another solution in my edit above. This solution should run very fast since it involves only simple step of on reindexing the row index only, without multiple steps in grouping or re-formatting the dataframe. Take a look.
Thank you, your solution for me looked very reasonable. I think I messed up the csv I read while using your method. It worked now. Thank you so much!
@YangZiqi Great that it works for you now. Make good use of my updated solution. As I mentioned, this solution is straightforward on achieving this specific task without unnecessary grouping or reformatting of the dataframe. Hence, it is more efficient.
1

One possible solution is the following:

(df.groupby('id')['month']
   .apply(lambda x:np.arange(1, 13))
   .explode()
   .reset_index()
   .merge(df, how='left')
   
)

which produces:

id month  average tree growth (cm)
0    1     1                       NaN
1    1     2                       NaN
2    1     3                       NaN
3    1     4                       9.0
4    1     5                       4.0
5    1     6                       7.0
6    1     7                       NaN
7    1     8                       NaN
8    1     9                       NaN
9    1    10                       NaN
10   1    11                       NaN
11   1    12                       NaN
12   2     1                       9.0
13   2     2                       9.0
14   2     3                       8.0
15   2     4                       6.0
16   2     5                       NaN
17   2     6                       NaN
18   2     7                       NaN
19   2     8                       NaN
20   2     9                       NaN
21   2    10                       NaN
22   2    11                       NaN
23   2    12                       NaN
​

1 Comment

Don't forget to mark the answer you accept as accepted. This we it disappears from the list of unanswered question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.