Consider the dataframes df1 and df2
df1 = pd.DataFrame({
'unique_id': [1, 2, 3],
'price': [11, 12, 13],
})
df2 = pd.DataFrame({
'unique_id': [1, 2, 3, 4, 5],
'price': [9, 10, 11, 12, 13],
})
merge
df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left')
price unique_id price2
0 11 1 9
1 12 2 10
2 13 3 11
join
df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2')
price unique_id price2
0 11 1 9
1 12 2 10
2 13 3 11
Experimental: FAST
Using numpy.searchsorted
def pir1(d1, d2):
u1 = d1.unique_id.values
u2 = d2.unique_id.values
p2 = d2.price.values
a = u2.argsort()
u = np.empty_like(a)
u[a] = np.arange(a.size)
return d1.assign(price2=p2[a][u2[a].searchsorted(u1)])
pir1(df1, df2)
price unique_id price2
0 11 1 9
1 12 2 10
2 13 3 11
Timing
pir1 fastest method
small data
%timeit pir1(df1, df2)
1000 loops, best of 3: 279 µs per loop
%timeit df1.assign(price2=df1['unique_id'].map(df2.set_index('unique_id')['price']))
1000 loops, best of 3: 892 µs per loop
%timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left')
1000 loops, best of 3: 1.18 ms per loop
%timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2')
1000 loops, best of 3: 1.02 ms per loop
large data
Using @jezrael's test data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N)
df1 = pd.DataFrame({'unique_id': np.random.choice(L, N),
'price':np.random.choice(L, N)})
df2 = pd.DataFrame({'unique_id': np.arange(N),
'price':np.random.choice(L, N)})
%timeit pir1(df1, df2)
10 loops, best of 3: 104 ms per loop
%timeit df1.assign(price2=df1['unique_id'].map(df2.set_index('unique_id')['price']))
10 loops, best of 3: 138 ms per loop
%timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left')
1 loop, best of 3: 243 ms per loop
%timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2')
10 loops, best of 3: 168 ms per loop