Update missing values in a column using pandas

Question

I have a dataframe df with two of the columns being 'city' and 'zip_code':

df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})

As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.

How do I accomplish this task using pandas?

Colonel Beauvel · Accepted Answer · 2016-10-28 08:47:01Z

1

You can go for:

import numpy as np

df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')

>>> df
         city zip_code
0   Cambridge    12345
1  Washington    67891
2       Miami    23457
3   Cambridge    12345
4       Miami    23457
5  Washington    67891

edited Oct 28, 2016 at 8:47

answered Oct 28, 2016 at 8:40

Colonel Beauvel

31.3k11 gold badges49 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ComplexData Over a year ago

TypeError: cannot use label indexing with a null key

João Almeida Over a year ago

This answer is better than EdChum's as it doesn't give an error on cases where you have cities with different zip_codes. It just chooses the first one.

EdChum Over a year ago

@JoãoAlmeida drop_duplicates will take the first duplicate so the behaviour should be the same as this answer

João Almeida Over a year ago

@EdChum only if the entire row is a duplicate, I was talking about having one city with two different zip codes. For instance: Miami 23357 Miami 23457 This would only happen with incorrect data, but could happen

EdChum Over a year ago

@JoãoAlmeida you can pass subset='city' param to drop_duplicates which would then take the first entry

EdChum · Accepted Answer · 2016-10-28 09:38:30Z

1

You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:

In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df

Out[255]:
         city zip_code
0   Cambridge    12345
1  Washington    67891
2       Miami    23457
3   Cambridge    12345
4       Miami    23457
5  Washington    67891

If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:

df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])

The reason you need to do this is because it'll raise an error if there are duplicate index entries

edited Oct 28, 2016 at 9:38

answered Oct 28, 2016 at 8:38

EdChum

397k204 gold badges836 silver badges583 bronze badges

4 Comments

ComplexData Over a year ago

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

EdChum Over a year ago

This means you may have NaN values rather than blank string, if so you can replace these first df['zip_code'] = df['zip_code'].fillna('') then the code should work

EdChum Over a year ago

What is the dtype for that column? Is it numeric or is it str? In your question the zip_codes are string if your real data is not then you need to post a representative example in your question

EdChum Over a year ago

Well then your actual question bears little resemblance to your posted sample df then which basically wastes the community's time. Basically you can do this:

df.loc[df['zip_code'].isnull(), 'zip_code'] = df['city'].map(df[df['zip_code'].notnull()].drop_duplicates(subset='city').set_index('city')['zip_code'])

João Almeida · Accepted Answer · 2016-10-28 08:37:49Z

0

My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.

And then you use that dictionary to fill in all missing zip code values.

answered Oct 28, 2016 at 8:37

João Almeida

5,1972 gold badges24 silver badges37 bronze badges

Collectives™ on Stack Overflow

Update missing values in a column using pandas

3 Answers 3

5 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related