0

I have 2 dataframes df1 & df2. The df2 dataframes is a subset of df1 I extracted to do some cleaning. Both dataframes can be matched on index. I seen a lot of merges on the site. I don't want to add more columns to df1 and the dataframes are not the same size df1 has 1000 rows and df2 has 275 rows so i don't want to replace the entire column. I want to update df1['AgeBin'] with the df2['AgeBin'] values where the indexes for these dataframes have a match.

indexes = df.loc[df.AgeBin.isin(dfage_test.AgeBin.values)].index
df1.at[indexes,'AgeBin'] = df2['AgeBin'].values

This was what I came up with but seems like there is an issue since df's are different sizes

ValueError: Must have equal len keys and value when setting with an iterable

Below is an over simplification. df1 has 26 columns and df2 has 12 columns the Agebin is the last column in both dfs. This is in theory what my objective is

df2
    AgeBin
0     2 
1     3 
2     1 
3     3 


df1
    AgeBin
0     NaN 
1     NaN 
2     NaN 
3     NaN 

df1 after update
    AgeBin
0     2 
1     3 
2     1 
3     3 

Here are dataframe specs

RangeIndex: 1309 entries, 0 to 1308
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       1046 non-null   float64 
 1   Survived          714 non-null    category
 2   Pclass            1046 non-null   category
 3   Name              1046 non-null   object  
 4   Sex               1046 non-null   object  
 5   Age               1046 non-null   float64 
 6   SibSp             1046 non-null   float64 
 7   Parch             1046 non-null   float64 
 8   Ticket            1046 non-null   object  
 9   Fare              1046 non-null   float64 
 10  Embarked          1046 non-null   category
 11  FamilySize        1046 non-null   float64 
 12  Surname           1046 non-null   object  
 13  Title             1046 non-null   object  
 14  IsChild           1046 non-null   float64 
 15  isMale            1046 non-null   category
 16  GroupID           1046 non-null   float64 
 17  GroupSize         1046 non-null   float64 
 18  GroupType         1046 non-null   object  
 19  GroupNumSurvived  1046 non-null   float64 
 20  GroupNumPerished  1046 non-null   float64 
 21  LargeGroup        1046 non-null   float64 
 22  SplitFare         1046 non-null   float64 
 23  log10Fare         1046 non-null   float64 
 24  log10SplitFare    1046 non-null   float64 
 25  AgeBin            1046 non-null   category
dtypes: category(5), float64(15), object(6)
memory usage: 221.9+ KB
  

dfageResults.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 5 to 1308
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   AgeBin  263 non-null    category
dtypes: category(1)
memory usage: 12.4 KB

Here are the categories

67] dfageResults.groupby(["AgeBin"])["AgeBin"].count()
AgeBin
0-14      25
15-29    192
30-44     46
Name: AgeBin, dtype: int64

[68] df.groupby(["AgeBin"])["AgeBin"].count()
AgeBin
0-14     107
15-29    462
30-44    301
45-59    136
60+       40
Name: AgeBin, dtype: int64
3
  • Could you please provide sample data of the two dataframes? Commented Jul 7, 2020 at 2:13
  • What you need is df1.combine_first(df2) Commented Jul 7, 2020 at 2:30
  • I tried your solution and it did not work. Thanks tho for helping! Commented Jul 7, 2020 at 2:39

2 Answers 2

1

Assuming all the indexed in df2 exist in df1 (which I understand is the case) - the below will suffice:

df1.loc[df2.index,:]=df2

In case if the above assumption for index won't hold - this is the alternative (same result - updates only existing indexes in df1):

df1.loc[set(df2.index).intersection(set(df1.index)),:]=df2

Sample output (with more representative sample data):

import pandas as pd
import numpy as np

df1=pd.DataFrame({"AgeBin": [1,2,3,'x', np.nan,np.nan,'a']})

df2=pd.DataFrame({"AgeBin": ['new1', 'new2', 123]}, index=[5,2,3])

print(df1)
print(df2)
df1.loc[df2.index,:]=df2
print(df1)

Outputs:

  AgeBin
0      1
1      2
2      3
3      x
4    NaN
5    NaN
6      a

  AgeBin
5   new1
2   new2
3    123

  AgeBin
0      1
1      2
2   new2
3    123
4    NaN
5   new1
6      a
Sign up to request clarification or add additional context in comments.

9 Comments

This looks great! it also makes sense. But I am getting am error
ValueError: Cannot set a Categorical with another, without identical categories
I reduced df2 to only 1 column and its index to make this easier. I also changed the df2 column values to categorical that match df1. Yet I am still getting error. I am python 3.7.7 if that helps.
Remember the df1 has 26 columns i have reduced df2 to just 1 now.
Try df1.loc[df2.index,'AgeBin']=df2
|
0

Try:

print('df2')
print(df2)

print('\ndf1')
print(df1)

df1.update(df2)

print('\ndf1 after update')
print(df1)

Output:

df2
  AgeBin
0  2    
1  3    
2  1    
3  3    

df1
   AgeBin
0 NaN    
1 NaN    
2 NaN    
3 NaN    

df1 after update
  AgeBin
0  2    
1  3    
2  1    
3  3   

5 Comments

your out put does not match my desired result? Also I am not sure I follow your logic how am i supposed to implement your code on 1000+ rows with 275 possible results?
Sorry I created my own dataset before. I also tested this out on different size dataframes and it worked fine.
I see the revised code i am getting the same error as above
ValueError: Cannot set a Categorical with another, without identical categories
I apologize I thought the number of rows was the only thing different in your problem not the number of columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.