1

I have a dataframe df with two of the columns being 'city' and 'zip_code':

df = pd.DataFrame({'city': ['Cambridge','Washington','Miami','Cambridge','Miami',
'Washington'], 'zip_code': ['12345','67891','23457','','','']})

As shown above, a particular city contains zip code in one of the rows, but the zip_code is missing for the same city in some other row. I want to fill those missing values based on the zip_code values of that city in some other row. Basically, wherever there is a missing zip_code, it checks zip_code for that city in other rows, and if found, fills the value for zip_code.If not found, fills 'NA'.

How do I accomplish this task using pandas?

3 Answers 3

1

You can go for:

import numpy as np

df['zip_code'] = df.replace(r'', np.nan).groupby('city')['zip_code'].fillna(method='ffill').fillna(method='bfill')

>>> df
         city zip_code
0   Cambridge    12345
1  Washington    67891
2       Miami    23457
3   Cambridge    12345
4       Miami    23457
5  Washington    67891
Sign up to request clarification or add additional context in comments.

5 Comments

TypeError: cannot use label indexing with a null key
This answer is better than EdChum's as it doesn't give an error on cases where you have cities with different zip_codes. It just chooses the first one.
@JoãoAlmeida drop_duplicates will take the first duplicate so the behaviour should be the same as this answer
@EdChum only if the entire row is a duplicate, I was talking about having one city with two different zip codes. For instance: Miami 23357 Miami 23457 This would only happen with incorrect data, but could happen
@JoãoAlmeida you can pass subset='city' param to drop_duplicates which would then take the first entry
1

You can check the string length using str.len and for those rows, filter the main df to those with valid zip_codes, set the index to those and call map on the 'city' column which will perform the lookup and fill those values:

In [255]:
df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].set_index('city')['zip_code'])
df

Out[255]:
         city zip_code
0   Cambridge    12345
1  Washington    67891
2       Miami    23457
3   Cambridge    12345
4       Miami    23457
5  Washington    67891

If your real data has lots of repeating values then you'll need to additionally call drop_duplicates first:

df.loc[df['zip_code'].str.len() == 0, 'zip_code'] = df['city'].map(df[df['zip_code'].str.len() == 5].drop_duplicates(subset='city').set_index('city')['zip_code'])

The reason you need to do this is because it'll raise an error if there are duplicate index entries

4 Comments

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
This means you may have NaN values rather than blank string, if so you can replace these first df['zip_code'] = df['zip_code'].fillna('') then the code should work
What is the dtype for that column? Is it numeric or is it str? In your question the zip_codes are string if your real data is not then you need to post a representative example in your question
Well then your actual question bears little resemblance to your posted sample df then which basically wastes the community's time. Basically you can do this: df.loc[df['zip_code'].isnull(), 'zip_code'] = df['city'].map(df[df['zip_code'].notnull()].drop_duplicates(subset='city').set_index('city')['zip_code'])
0

My suggestion would be to first create a dictonary that maps from the city to the zip code. You can create this dictionary from the one DataFrame.

And then you use that dictionary to fill in all missing zip code values.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.