1

I have two dataframes df1 and df2.

d = d = {'ID': [31,42,63,44,45,26], 
     'lat': [64,64,64,64,64,64],
     'lon': [152,152,152,152,152,152],
     'other1': [12,13,14,15,16,17],
     'other2': [21,22,23,24,25,26]}
df1 = pd.DataFrame(data=d)

d2 ={'ID': [27,48,31,45,49,10], 
     'LAT': [63,63,63,63,63,63],
     'LON': [153,153,153,153,153,153]}
df2 = pd.DataFrame(data=d2)

df1 has incorrect values for columns lat and lon, but has correct data in the other columns that I need to keep track of. df2 has correct LAT and LON values but only has a few common IDs with df1. There are two things I would like to accomplish. First, I want to split df1 into two dataframes: df3 which has IDs that are present in df2; and df4 which has everything else. I can get df3 with:

df3=pd.DataFrame()
for i in reduce(np.intersect1d, [df1.ID, df2.ID]):
    df3=df3.append(df1.loc[df1.ID==i])

but how do I get df4 to be the remaining data?

Second, I want to replace the lat and lon values in df3 with the correct data fromdf2. I figure there is a slick python way to do something like:

for j in range(len(df3)):
    for k in range(len(df2)):
        if df3.ID[j] == df2.ID[k]:
            df3.lat[j] = df2.LAT[k]
            df3.lon[j] = df2.LON[k]    

But I can't even get the above nested loop working correctly. I don't want to spend a lot of time getting it working if there is a better way to accomplish this in python.

2 Answers 2

2

For question 1, you can use boolean indexing:

m = df1.ID.isin(df2.ID)

df3 = df1[m]
df4 = df1[~m]

print(df3)
print(df4)

Prints:

   ID  lat  lon  other1  other2
0  31   64  152      12      21
4  45   64  152      16      25

   ID  lat  lon  other1  other2
1  42   64  152      13      22
2  63   64  152      14      23
3  44   64  152      15      24
5  26   64  152      17      26

For question 2:

x = df3.merge(df2, on="ID")[["ID", "other1", "other2", "LAT", "LON"]]
print(x)

Prints:

   ID  other1  other2  LAT  LON
0  31      12      21   63  153
1  45      16      25   63  153

EDIT: For question 2 you can do:

x = df3.merge(df2, on="ID").drop(columns=["lat", "lon"])
print(x)
Sign up to request clarification or add additional context in comments.

2 Comments

The above answer works great. Thank you! But, on my actually dataset I have 300+ columns, so listing out the columns in x = df3.merge(df2, on="ID")[["ID", "other1", "other2", "LAT", "LON"]] is not practical. Is there away to indicate keeping all other columns without listing them?
@ChloePeterson I've updated my answer. For the last step just drop the columns "lat", "lon". The "LAT", "LON" will have new values.
1

You can merge with indicator True and then keep preference for LAT and LON and fill the rest by lat and lon, then use the indicator and a grouper and create a dictionary. Then grab the keys of the dictionary:

u = df1.merge(df2,on='ID',how='left',indicator='I')
u[['LAT','LON']] = np.where(u[['LAT','LON']].isna(),u[['lat','lon']],u[['LAT','LON']])
u = u.drop(['lat','lon'],1)
u['I'] = np.where(u['I'].eq("left_only"),"left_df","others")
d = dict(iter(u.groupby("I")))

print(d['left_df'],'\n--------\n',d['others'])

   ID  other1  other2   LAT    LON        I
1  42      13      22  64.0  152.0  left_df
2  63      14      23  64.0  152.0  left_df
3  44      15      24  64.0  152.0  left_df
5  26      17      26  64.0  152.0  left_df 
--------
    ID  other1  other2   LAT    LON       I
0  31      12      21  63.0  153.0  others
4  45      16      25  63.0  153.0  others

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.