Creating columns in a dataframe using data from another dataframe with conditions without using for loops

Question

I have two dataframes, df_1 and df_2

df_1 has 30k+ rows and looks like this

Col_1_1    Col_1_2    CA_CB
a          c          CA
a          c          CB
a          d          CA
b          c          CA
b          d          CB
b          d          CB
b          c          CA

I'd like to create two columns in df_1 using data coming from df_2 if column CA_CB = "CB"

df_2 has 1k row and looks like this (Col_2_1 has unique values)

Col_2_1    Col_2_2
a          data on a
b          data on b
c          data on c
d          data on d

My output should look like this:

Col_1_1    Col_1_2    CA_CB    Col_target_1    Col_target_2
a          c          CA       "X"             "X"
a          c          CB       data on a       data on c
a          d          CA       "X"             "X"
b          c          CA       "X"             "X"
b          d          CB       data on b       data on d
b          d          CB       data on b       data on d
b          c          CA       "X"             "X"

The way I'm doing it currently is creating Col_target_1 and Col_target_2 with

df_1["Col_target_1"] = "X"
df_2["Col_target_2"] = "X"

for i in range(len(df_1)):
    if df_1["CA_CB"][i] == "CB":
        for j in range(len(df_2)):
            if df_1["Col_1_1"][i] == df_2["Col_2_1"][j]:
                df_1["Col_target_1"][i] = df_2["Col_2_2"][j]
            if df_1["Col_1_2"][i] == df_2["Col_2_1"][j]:
                df_1["Col_target_2"][i] = df_2["Col_2_2"][j]

This is doing the job I want it to. But it is taking 20+ minutes to do so, and I was wondering if it could be run faster using another method.

Thank you in advance.

jpp · Accepted Answer · 2018-07-02 14:36:37Z

3

First create a series mapping from df_2:

s = df_2.set_index('Col_2_1')['Col_2_2']

Then map conditionally to df_1 using numpy.where:

mask = df_1['CA_CB'] == 'CB'

df_1['Col_target_1'] = np.where(mask, df_1['Col_1_1'].map(s), 'X')
df_1['Col_target_2'] = np.where(mask, df_1['Col_1_2'].map(s), 'X')

mask returns a Boolean series, which np.where uses to decide element-wise whether to select the second or third arguments.

edited Jul 2, 2018 at 14:36

answered Jul 2, 2018 at 14:31

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JEA203 Over a year ago

Works perfectly ! Thank you very much ! Takes less than 0.5 seconds !

Collectives™ on Stack Overflow

Creating columns in a dataframe using data from another dataframe with conditions without using for loops

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related