1

There are good_df and bad_df:

    article=['A9911652','A9911653','A9911654','A9911659','A9911661']
    price1 = [0.01,7041.33,0.01,0.01,6067.27]
    price2 =  [0.01,0.01,9324.63,0.01,6673.99]
    price3 = [2980.31,2869.4,0.01,1622.78,0.01]
    bad_df = pd.DataFrame(list(zip(article, price1, price2, price3)),columns =['article', 'price1', 'price2', 'price3'])

    article=[  'A9911652','A9911653','A9911654','A9911659','A9911661']
    price1 = [  5,7041.33,9846,4785.74,6067.27]
    price2 =  [np.NaN,562,9324.63,9841,6673.99]
    price3 = [5,2869.4,6812,1622.78,3516]
    good_df = pd.DataFrame(list(zip(article, price1, price2, price3)),columns =['article', 'price1', 'price2', 'price3'])

    'bad_df:
        article   price1   price2   price3
    0  A9911652     0.01     0.01  2980.31
    1  A9911653  7041.33     0.01  2869.40
    2  A9911654     0.01  9324.63     0.01
    3  A9911659     0.01     0.01  1622.78
    4  A9911661  6067.27  6673.99     0.01

    'good_df:
        article    price1   price2  price3
    0   A9911652    5.00    NaN     5.00
    1   A9911653    7041.33 562.00  2869.40
    2   A9911654    9846.00 9324.63 6812.00
    3   A9911659    4785.74 9841.00 1622.78
    4   A9911661    6067.27 6673.99 3516.00

I'd like to replace the 0.01s of the bad_df (columns 'price1', 'price2', 'price3') with values of good_df, if they are non NaN.

I thought of something like this:

    s=good_df.set_index('article')['price1','price2', 'price3']
    bad_df[s]=good_df['article'].map(s).good_df.s

Please help me with that.

1
  • Please share the expected output Commented Feb 4, 2020 at 7:21

2 Answers 2

2

Use DataFrame.merge by article with left join, only before replace 0.01 to missing values by DataFrame.mask, last replace by original values:

df = (bad_df.mask(bad_df == 0.01)
            .merge(good_df, on='article', suffixes=('','_'))
            .fillna(good_df)
            .fillna(0.01)[good_df.columns])
print (df)
    article   price1   price2   price3
0  A9911652     5.00     0.01  2980.31
1  A9911653  7041.33   562.00  2869.40
2  A9911654  9846.00  9324.63  6812.00
3  A9911659  4785.74  9841.00  1622.78
4  A9911661  6067.27  6673.99  3516.00

Solution above working only same article values and also same order in both DataFrames, for general solution is necessary replace by columns in merged DataFrame:

df = bad_df.mask(bad_df == 0.01).merge(good_df, on='article', suffixes=('','_'), how='left')
cols = good_df.columns.difference(['article'], sort=False)
df[cols] = df[cols].fillna(df[cols + '_'].fillna(0.01).rename(columns=lambda x: x.strip('_')))
df = df[good_df.columns]
print (df)
    article   price1   price2   price3
0  A9911652     5.00     0.01  2980.31
1  A9911653  7041.33   562.00  2869.40
2  A9911654  9846.00  9324.63  6812.00
3  A9911659  4785.74  9841.00  1622.78
4  A9911661  6067.27  6673.99  3516.00
Sign up to request clarification or add additional context in comments.

3 Comments

@SergeyBelousov - Super, also added more general solution
The general solution you provided is for the case when DataFrames have a different order of columns, i. e. 'price3','price1', 'price2' instead of 'price1','price2', 'price3'?
@SergeyBelousov - yop, or different order of article values
1

The difficulty of this task results from the fact that the condition to check involves respective cells of both DataFrames. This is why "ordinary" pd.where is not an option.

So I decided to join both DataFrames (on article) and then apply a function to each row, generating the target row.

To do your dask, define the following function:

def upd(row):
    '''
    Generate an updated row for "bad_df"
    row -  a joined row for "bad_df" and "good_df"
    '''
    siz = row.size
    siz2 = siz // 2  # Size of the left half (from bad_df)
    # Operate on Numpy arrays to get rid of column names
    v1 = row.values[0:siz2]  # Left half (from bad_df)
    v2 = row.values[siz2:]   # Right half (from good_df)
    msk = np.equal(v1, 0.01) & ~np.isnan(v2)
    return pd.Series(np.where(msk, v2, v1), index=row.index[0:siz2])

Then apply it:

bad_df.set_index('article').join(good_df.set_index('article'),
    rsuffix='_g').apply(upd, axis=1).reset_index()

Note:

My solution works correctly also in cases when bad_df contains "additional" rows with article not present in good_df.

To demonstrate this feature, I added one row to bad_df, so that it contains:

    article   price1   price2   price3
0  A9911652     0.01     0.01  2980.31
1  A9911653  7041.33     0.01  2869.40
2  A9911654     0.01  9324.63     0.01
3  A9911659     0.01     0.01  1622.78
4  A9911661  6067.27  6673.99     0.01
5      AXXX     0.01     0.01     0.01

Then my code gives:

    article   price1   price2   price3
0  A9911652     5.00     0.01  2980.31
1  A9911653  7041.33   562.00  2869.40
2  A9911654  9846.00  9324.63  6812.00
3  A9911659  4785.74  9841.00  1622.78
4  A9911661  6067.27  6673.99  3516.00
5      AXXX     0.01     0.01     0.01

leaving this additional row untouched (no corresponding data in good_df) while the other solution deletes this row.

5 Comments

@Vladi_Bo I run the code you provided. No errors, but bad_df is still the same
My code only generates the proper result, without saving it anywhere. If you want to overwrite bad_df with this result, run *bad_df = * + the above code.
@SergeyBelousov - I think this solution failed, if not same order of columns in both DataFrames, also convert all columns to strings if some numeric and non numeric columns, better not use it if need general solution.
@SergeyBelousov - But if all numeric columns, same order then working.
@Valdi_Bo - Please change your answer, because it seems you want general solution, but it is not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.