0

I have two dataframes, each of which has two columns: unique_id, price. df1 has a subset of all unique_id's in df2.

Now I need to add a third column to df1 that has the price for that unique_id element in df2. i.e. the columns will be: unique_id, price, price2.

How do I do this?

3 Answers 3

3

Faster is use map:

df1 = pd.DataFrame({'unique_id':[1,2,3,1,2,3],
                   'price':[4,5,6,7,8,9]})

print (df1)

df2 = pd.DataFrame({'unique_id':[1,2,3],
                    'price':[46,55,44]})

print (df2)

df1['price2'] = df1['unique_id'].map(df2.set_index('unique_id')['price'])
print (df1)
   price  unique_id  price2
0      4          1      46
1      5          2      55
2      6          3      44
3      7          1      46
4      8          2      55
5      9          3      44

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N)
df1 = pd.DataFrame({'unique_id': np.random.choice(L, N),
                   'price':np.random.choice(L, N)})
print (df1)

df2 = pd.DataFrame({'unique_id': np.arange(N),
                   'price':np.random.choice(L, N)})

print (df2)

In [60]: %timeit df1['price2'] = df1['unique_id'].map(df2.set_index('unique_id')['price'])
1 loop, best of 3: 168 ms per loop

In [61]: %timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left')
1 loop, best of 3: 373 ms per loop

In [62]: %timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2')
1 loop, best of 3: 252 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

2

another solution:

df1['price_df2'] = df1['unique_id'].map(df2.set_index('unique_id')['price'])

again borrowing @piRSquared's sample DFs ;-)

In [42]: df1
Out[42]:
   price  unique_id
0     11          1
1     12          2
2     13          3

In [43]: df2
Out[43]:
   price  unique_id
0      9          1
1     10          2
2     11          3
3     12          4
4     13          5

In [44]: df1['price_df2'] = df1['unique_id'].map(df2.set_index('unique_id')['price'])

In [45]: df1
Out[45]:
   price  unique_id  price_df2
0     11          1          9
1     12          2         10
2     13          3         11

Comments

2

Consider the dataframes df1 and df2

df1 = pd.DataFrame({
        'unique_id': [1, 2, 3],
        'price': [11, 12, 13],
    })

df2 = pd.DataFrame({
    'unique_id': [1, 2, 3, 4, 5],
    'price': [9, 10, 11, 12, 13],
})

merge

df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left')

   price  unique_id  price2
0     11          1       9
1     12          2      10
2     13          3      11

join

df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2')

   price  unique_id  price2
0     11          1       9
1     12          2      10
2     13          3      11

Experimental: FAST
Using numpy.searchsorted

def pir1(d1, d2):
    u1 = d1.unique_id.values
    u2 = d2.unique_id.values
    p2 = d2.price.values
    a = u2.argsort()
    u = np.empty_like(a)
    u[a] = np.arange(a.size)
    return d1.assign(price2=p2[a][u2[a].searchsorted(u1)])

pir1(df1, df2)

   price  unique_id  price2
0     11          1       9
1     12          2      10
2     13          3      11

Timing
pir1 fastest method
small data

%timeit pir1(df1, df2)
1000 loops, best of 3: 279 µs per loop

%timeit df1.assign(price2=df1['unique_id'].map(df2.set_index('unique_id')['price']))
1000 loops, best of 3: 892 µs per loop

%timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left')
1000 loops, best of 3: 1.18 ms per loop

%timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2')
1000 loops, best of 3: 1.02 ms per loop

large data
Using @jezrael's test data

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N)
df1 = pd.DataFrame({'unique_id': np.random.choice(L, N),
                   'price':np.random.choice(L, N)})

df2 = pd.DataFrame({'unique_id': np.arange(N),
                   'price':np.random.choice(L, N)})


%timeit pir1(df1, df2)
10 loops, best of 3: 104 ms per loop

%timeit df1.assign(price2=df1['unique_id'].map(df2.set_index('unique_id')['price']))
10 loops, best of 3: 138 ms per loop

%timeit df1.merge(df2, on='unique_id', suffixes=['', '2'], how='left')
1 loop, best of 3: 243 ms per loop

%timeit df1.join(df2.set_index('unique_id'), on='unique_id', rsuffix='2')
10 loops, best of 3: 168 ms per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.