How to substitute specific values in multiple columns with corresponding values of another dataframe?

Question

There are good_df and bad_df:

    article=['A9911652','A9911653','A9911654','A9911659','A9911661']
    price1 = [0.01,7041.33,0.01,0.01,6067.27]
    price2 =  [0.01,0.01,9324.63,0.01,6673.99]
    price3 = [2980.31,2869.4,0.01,1622.78,0.01]
    bad_df = pd.DataFrame(list(zip(article, price1, price2, price3)),columns =['article', 'price1', 'price2', 'price3'])

    article=[  'A9911652','A9911653','A9911654','A9911659','A9911661']
    price1 = [  5,7041.33,9846,4785.74,6067.27]
    price2 =  [np.NaN,562,9324.63,9841,6673.99]
    price3 = [5,2869.4,6812,1622.78,3516]
    good_df = pd.DataFrame(list(zip(article, price1, price2, price3)),columns =['article', 'price1', 'price2', 'price3'])

    'bad_df:
        article   price1   price2   price3
    0  A9911652     0.01     0.01  2980.31
    1  A9911653  7041.33     0.01  2869.40
    2  A9911654     0.01  9324.63     0.01
    3  A9911659     0.01     0.01  1622.78
    4  A9911661  6067.27  6673.99     0.01

    'good_df:
        article    price1   price2  price3
    0   A9911652    5.00    NaN     5.00
    1   A9911653    7041.33 562.00  2869.40
    2   A9911654    9846.00 9324.63 6812.00
    3   A9911659    4785.74 9841.00 1622.78
    4   A9911661    6067.27 6673.99 3516.00

I'd like to replace the 0.01s of the bad_df (columns 'price1', 'price2', 'price3') with values of good_df, if they are non NaN.

I thought of something like this:

    s=good_df.set_index('article')['price1','price2', 'price3']
    bad_df[s]=good_df['article'].map(s).good_df.s

Please help me with that.

Please share the expected output

Vishakha Lall
– Vishakha Lall

2020-02-04 07:21:22 +00:00
Commented Feb 4, 2020 at 7:21 — Vishakha Lall
– Vishakha Lall, Commented Feb 4, 2020 at 7:21

jezrael · Accepted Answer · 2020-02-04 07:48:03Z

2

Use DataFrame.merge by article with left join, only before replace 0.01 to missing values by DataFrame.mask, last replace by original values:

df = (bad_df.mask(bad_df == 0.01)
            .merge(good_df, on='article', suffixes=('','_'))
            .fillna(good_df)
            .fillna(0.01)[good_df.columns])
print (df)
    article   price1   price2   price3
0  A9911652     5.00     0.01  2980.31
1  A9911653  7041.33   562.00  2869.40
2  A9911654  9846.00  9324.63  6812.00
3  A9911659  4785.74  9841.00  1622.78
4  A9911661  6067.27  6673.99  3516.00

Solution above working only same article values and also same order in both DataFrames, for general solution is necessary replace by columns in merged DataFrame:

df = bad_df.mask(bad_df == 0.01).merge(good_df, on='article', suffixes=('','_'), how='left')
cols = good_df.columns.difference(['article'], sort=False)
df[cols] = df[cols].fillna(df[cols + '_'].fillna(0.01).rename(columns=lambda x: x.strip('_')))
df = df[good_df.columns]
print (df)
    article   price1   price2   price3
0  A9911652     5.00     0.01  2980.31
1  A9911653  7041.33   562.00  2869.40
2  A9911654  9846.00  9324.63  6812.00
3  A9911659  4785.74  9841.00  1622.78
4  A9911661  6067.27  6673.99  3516.00

edited Feb 4, 2020 at 7:48

answered Feb 4, 2020 at 7:25

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jezrael Over a year ago

@SergeyBelousov - Super, also added more general solution

Sergey Belousov Over a year ago

The general solution you provided is for the case when DataFrames have a different order of columns, i. e. 'price3','price1', 'price2' instead of 'price1','price2', 'price3'?

jezrael Over a year ago

@SergeyBelousov - yop, or different order of article values

Valdi_Bo · Accepted Answer · 2020-02-04 09:16:03Z

1

The difficulty of this task results from the fact that the condition to check involves respective cells of both DataFrames. This is why "ordinary" pd.where is not an option.

So I decided to join both DataFrames (on article) and then apply a function to each row, generating the target row.

To do your dask, define the following function:

def upd(row):
    '''
    Generate an updated row for "bad_df"
    row -  a joined row for "bad_df" and "good_df"
    '''
    siz = row.size
    siz2 = siz // 2  # Size of the left half (from bad_df)
    # Operate on Numpy arrays to get rid of column names
    v1 = row.values[0:siz2]  # Left half (from bad_df)
    v2 = row.values[siz2:]   # Right half (from good_df)
    msk = np.equal(v1, 0.01) & ~np.isnan(v2)
    return pd.Series(np.where(msk, v2, v1), index=row.index[0:siz2])

Then apply it:

bad_df.set_index('article').join(good_df.set_index('article'),
    rsuffix='_g').apply(upd, axis=1).reset_index()

Note:

My solution works correctly also in cases when bad_df contains "additional" rows with article not present in good_df.

To demonstrate this feature, I added one row to bad_df, so that it contains:

    article   price1   price2   price3
0  A9911652     0.01     0.01  2980.31
1  A9911653  7041.33     0.01  2869.40
2  A9911654     0.01  9324.63     0.01
3  A9911659     0.01     0.01  1622.78
4  A9911661  6067.27  6673.99     0.01
5      AXXX     0.01     0.01     0.01

Then my code gives:

    article   price1   price2   price3
0  A9911652     5.00     0.01  2980.31
1  A9911653  7041.33   562.00  2869.40
2  A9911654  9846.00  9324.63  6812.00
3  A9911659  4785.74  9841.00  1622.78
4  A9911661  6067.27  6673.99  3516.00
5      AXXX     0.01     0.01     0.01

leaving this additional row untouched (no corresponding data in good_df) while the other solution deletes this row.

edited Feb 4, 2020 at 9:16

answered Feb 4, 2020 at 8:53

Valdi_Bo

31.1k4 gold badges29 silver badges45 bronze badges

5 Comments

Sergey Belousov Over a year ago

@Vladi_Bo I run the code you provided. No errors, but bad_df is still the same

Valdi_Bo Over a year ago

My code only generates the proper result, without saving it anywhere. If you want to overwrite bad_df with this result, run *bad_df = * + the above code.

jezrael Over a year ago

@SergeyBelousov - I think this solution failed, if not same order of columns in both DataFrames, also convert all columns to strings if some numeric and non numeric columns, better not use it if need general solution.

jezrael Over a year ago

@SergeyBelousov - But if all numeric columns, same order then working.

jezrael Over a year ago

@Valdi_Bo - Please change your answer, because it seems you want general solution, but it is not.

Collectives™ on Stack Overflow

How to substitute specific values in multiple columns with corresponding values of another dataframe?

2 Answers 2

3 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related