1

Suppose I have a dataframe like this:

country year value
A 2008 1
A 2011 1
B 2008 1
B 2011 1

I want to add missing year per country, in this case 2009 and 2010, with desired output like this:

country year value
A 2008 1
A 2009
A 2010
A 2011 1
B 2008 1
B 2009
B 2010
B 2011 1

How can I do that? Thanks in advance!

2
  • Will it always have two records? Or may we expect more for a country? Commented Sep 3, 2022 at 5:38
  • I expect more country Commented Sep 9, 2022 at 10:47

4 Answers 4

3

First let's create your dataset for the MCVE sake:

import pandas as pd

frame = pd.DataFrame([
    {"country": "A", "year": 2008, "value": 1},
    {"country": "A", "year": 2011, "value": 1},
    {"country": "B", "year": 2008, "value": 1},
    {"country": "B", "year": 2011, "value": 1},
])

Then we create the missing data by ruling from min(year) to max(year):

extension = frame.groupby("country")["year"].agg(["min", "max"]).reset_index()
extension["year"] = extension.apply(lambda x: list(range(x["min"], x["max"] + 1)), axis=1)

#   country   min   max                      year
# 0       A  2008  2011  [2008, 2009, 2010, 2011]
# 1       B  2008  2011  [2008, 2009, 2010, 2011]

Exploding the structure gives the correct format but without values:

extension = extension.explode("year")[["country", "year"]]
extension["year"] = extension["year"].astype(int)

#   country  year
# 0       A  2008
# 0       A  2009
# 0       A  2010
# 0       A  2011
# 1       B  2008
# 1       B  2009
# 1       B  2010
# 1       B  2011

Then we merge back with the original data to get the values:

results = frame.merge(extension, how="right", on=["country", "year"])

#   country  year  value
# 0       A  2008    1.0
# 1       A  2009    NaN
# 2       A  2010    NaN
# 3       A  2011    1.0
# 4       B  2008    1.0
# 5       B  2009    NaN
# 6       B  2010    NaN
# 7       B  2011    1.0

The advantage of this method - in addition of being purely pandas - is that it is robust against data variation:

frame = pd.DataFrame([
    {"country": "A", "year": 2008, "value": 1},
    {"country": "A", "year": 2011, "value": 2},
    {"country": "B", "year": 2005, "value": 1},
    {"country": "B", "year": 2009, "value": 2},
    {"country": "C", "year": 2008, "value": 1},
    {"country": "C", "year": 2010, "value": 2},
    {"country": "C", "year": 2012, "value": 3},
])

#    country  year  value
# 0        A  2008    1.0
# 1        A  2009    NaN
# 2        A  2010    NaN
# 3        A  2011    2.0
# 4        B  2005    1.0
# 5        B  2006    NaN
# 6        B  2007    NaN
# 7        B  2008    NaN
# 8        B  2009    2.0
# 9        C  2008    1.0
# 10       C  2009    NaN
# 11       C  2010    2.0
# 12       C  2011    NaN
# 13       C  2012    3.0
Sign up to request clarification or add additional context in comments.

Comments

0

Let's create a dataframe first as follows :

import pandas as pd
data = {'country' : ['A', 'A', 'B', 'B'], 
        'year' : ['2008', '2011', '2008', '2011'], 
        'value':[1,1,1,1]}
df = pd.DataFrame(data=data)

Created dataset :

  country  year  value
0       A  2008      1
1       A  2011      1
2       B  2008      1
3       B  2011      1

Lets define the years we need to consider :

yr_list = ['2008', '2009', '2010', '2011']

Lets modify the dataset based on our requirement :

for country in df['country'].unique() : 
  for yr in yr_list :
    if yr not in list(df.loc[df['country'] == country, 'year']): 
      update_data = {'country' : country, 'year' : yr}
      df = df.append(update_data, ignore_index = True)

final_df = df.sort_values(by = ['country', 'year'],ignore_index=True)
print(final_df) 

The final output :

  country  year  value
0       A  2008    1.0
1       A  2009    NaN
2       A  2010    NaN
3       A  2011    1.0
4       B  2008    1.0
5       B  2009    NaN
6       B  2010    NaN
7       B  2011    1.0

1 Comment

It looses the benefit of using dataframe when writing explicit for loops to process the data. Boolean indexing is a great capability, but here it will slow down the process because it is nested in two for loops of cardinality #years x #countries. Also it requires to know year list in advance and hard code it. What if years are different for countries?
0

One option is with the complete function from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

Create a dictionary, with an anonymous function, containing all possible years:

new_years = {'year': lambda year: range(year.min(), year.max() + 1)}

Use the dictionary within complete, with the by parameter, so it is applied per group:

df.complete(new_years, by = 'country')
  country  year  value
0       A  2008    1.0
1       A  2009    NaN
2       A  2010    NaN
3       A  2011    1.0
4       B  2008    1.0
5       B  2009    NaN
6       B  2010    NaN
7       B  2011    1.0

Comments

-1
arr1 = [['A', 2008, 1],['A', 2011, 1],['B', 2008, 1],['B', 2011, 1]]

arr2 = [['A', 2008, 1],['A', 2009, None],['A', 2010, None],à['A', 2011, 1],['B', 2008, 1],['B', 2009, None],['B', 2010, None],['B', 2011, 1]]

for elm in arr2:
    if elm not in arr1:
        arr1.append(elm)

2 Comments

This is the same as manually encoding the data
This doesn't answer the question at all

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.