Python pandas: how to create a new row based on missing value from a column?

Question

Suppose I have a dataframe like this:

country	year	value
A	2008	1
A	2011	1
B	2008	1
B	2011	1

I want to add missing year per country, in this case 2009 and 2010, with desired output like this:

country	year	value
A	2008	1
A	2009
A	2010
A	2011	1
B	2008	1
B	2009
B	2010
B	2011	1

How can I do that? Thanks in advance!

Will it always have two records? Or may we expect more for a country? — jlandercy
– jlandercy, Commented Sep 3, 2022 at 5:38

jlandercy · Accepted Answer · 2022-09-03 06:18:10Z

First let's create your dataset for the MCVE sake:

import pandas as pd

frame = pd.DataFrame([
    {"country": "A", "year": 2008, "value": 1},
    {"country": "A", "year": 2011, "value": 1},
    {"country": "B", "year": 2008, "value": 1},
    {"country": "B", "year": 2011, "value": 1},
])

Then we create the missing data by ruling from min(year) to max(year):

extension = frame.groupby("country")["year"].agg(["min", "max"]).reset_index()
extension["year"] = extension.apply(lambda x: list(range(x["min"], x["max"] + 1)), axis=1)

#   country   min   max                      year
# 0       A  2008  2011  [2008, 2009, 2010, 2011]
# 1       B  2008  2011  [2008, 2009, 2010, 2011]

Exploding the structure gives the correct format but without values:

extension = extension.explode("year")[["country", "year"]]
extension["year"] = extension["year"].astype(int)

#   country  year
# 0       A  2008
# 0       A  2009
# 0       A  2010
# 0       A  2011
# 1       B  2008
# 1       B  2009
# 1       B  2010
# 1       B  2011

Then we merge back with the original data to get the values:

results = frame.merge(extension, how="right", on=["country", "year"])

#   country  year  value
# 0       A  2008    1.0
# 1       A  2009    NaN
# 2       A  2010    NaN
# 3       A  2011    1.0
# 4       B  2008    1.0
# 5       B  2009    NaN
# 6       B  2010    NaN
# 7       B  2011    1.0

The advantage of this method - in addition of being purely pandas - is that it is robust against data variation:

frame = pd.DataFrame([
    {"country": "A", "year": 2008, "value": 1},
    {"country": "A", "year": 2011, "value": 2},
    {"country": "B", "year": 2005, "value": 1},
    {"country": "B", "year": 2009, "value": 2},
    {"country": "C", "year": 2008, "value": 1},
    {"country": "C", "year": 2010, "value": 2},
    {"country": "C", "year": 2012, "value": 3},
])

#    country  year  value
# 0        A  2008    1.0
# 1        A  2009    NaN
# 2        A  2010    NaN
# 3        A  2011    2.0
# 4        B  2005    1.0
# 5        B  2006    NaN
# 6        B  2007    NaN
# 7        B  2008    NaN
# 8        B  2009    2.0
# 9        C  2008    1.0
# 10       C  2009    NaN
# 11       C  2010    2.0
# 12       C  2011    NaN
# 13       C  2012    3.0

Kovarthanan Kesavan · Accepted Answer · 2022-09-03 06:07:04Z

0

Let's create a dataframe first as follows :

import pandas as pd
data = {'country' : ['A', 'A', 'B', 'B'], 
        'year' : ['2008', '2011', '2008', '2011'], 
        'value':[1,1,1,1]}
df = pd.DataFrame(data=data)

Created dataset :

  country  year  value
0       A  2008      1
1       A  2011      1
2       B  2008      1
3       B  2011      1

Lets define the years we need to consider :

yr_list = ['2008', '2009', '2010', '2011']

Lets modify the dataset based on our requirement :

for country in df['country'].unique() : 
  for yr in yr_list :
    if yr not in list(df.loc[df['country'] == country, 'year']): 
      update_data = {'country' : country, 'year' : yr}
      df = df.append(update_data, ignore_index = True)

final_df = df.sort_values(by = ['country', 'year'],ignore_index=True)
print(final_df)

The final output :

  country  year  value
0       A  2008    1.0
1       A  2009    NaN
2       A  2010    NaN
3       A  2011    1.0
4       B  2008    1.0
5       B  2009    NaN
6       B  2010    NaN
7       B  2011    1.0

edited Sep 3, 2022 at 6:07

answered Sep 3, 2022 at 5:59

Kovarthanan Kesavan

315 bronze badges

1 Comment

jlandercy Over a year ago

It looses the benefit of using dataframe when writing explicit for loops to process the data. Boolean indexing is a great capability, but here it will slow down the process because it is nested in two for loops of cardinality #years x #countries. Also it requires to know year list in advance and hard code it. What if years are different for countries?

sammywemmy · Accepted Answer · 2022-09-25 10:13:03Z

0

One option is with the complete function from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

Create a dictionary, with an anonymous function, containing all possible years:

new_years = {'year': lambda year: range(year.min(), year.max() + 1)}

Use the dictionary within complete, with the by parameter, so it is applied per group:

df.complete(new_years, by = 'country')
  country  year  value
0       A  2008    1.0
1       A  2009    NaN
2       A  2010    NaN
3       A  2011    1.0
4       B  2008    1.0
5       B  2009    NaN
6       B  2010    NaN
7       B  2011    1.0

answered Sep 25, 2022 at 10:13

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Comments

cottontail · Accepted Answer · 2022-09-03 05:25:39Z

-1

arr1 = [['A', 2008, 1],['A', 2011, 1],['B', 2008, 1],['B', 2011, 1]]

arr2 = [['A', 2008, 1],['A', 2009, None],['A', 2010, None],à['A', 2011, 1],['B', 2008, 1],['B', 2009, None],['B', 2010, None],['B', 2011, 1]]

for elm in arr2:
    if elm not in arr1:
        arr1.append(elm)

edited Sep 3, 2022 at 5:25

cottontail

25.5k25 gold badges184 silver badges176 bronze badges

answered Sep 3, 2022 at 4:44

Mohamed Karim Mamlouk

1

2 Comments

jlandercy Over a year ago

This is the same as manually encoding the data

Derek O Over a year ago

This doesn't answer the question at all

Collectives™ on Stack Overflow

Python pandas: how to create a new row based on missing value from a column?

4 Answers 4

Comments

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related