Pandas Dataframe comparison

Question

I have 2 very large dataframes +20k rows. df_input and df_output.

df_input is made of test cases ; df_output is filled with the results from those test cases.

I need to select all the case numbers which failed from df_output and then fix those cases in the df_input dataframe. The fix is selecting a new unique date for each case_id.

To select a new unique date it has to be within 7*k days of the prior date, before or after. So I need to use Datetime.

Basically, I want to do this: select fail cases number from the output result

=> output_sheet[output_sheet[output_result =='FAIL']]
  => get the results in some array or vector  **(how ? )**

go to input_sheet, do

=> input_df.groupBy(input_carId)
=> replace the failing dates with a new unique date within +-7k days of that old date

but it has to be unique date for that input_carId. So I think I need to use unique().

I cannot use the output_df as input_df; they're 2 very different sheets. I greatly simplified their schema here, they only share 3 columns. And also, they actually are +20000 such rows and ids

In the end I have the old input_df but changed with the new dates.

output_df

case_id        output_date        output_carId   ouput_result
1                 01/20/21             001          FAIL
2                 02/21/21             001          SUCCESS  
3                 02/08/20             003          FAIL 
4                 01/07/20             001          FAIL
5                 09/05/20             002          SUCCESS

input_df (old)

case_id    input_date         input_carId  
    1          01/20/21             001  
    2          02/21/21             002 
    3          02/08/20             003
    4          01/07/20             001
    5          09/05/20             002

expected result =>

input_df (new)

   case_id   input_date         input_carId  
    1          01/13/21             001  
    2          02/21/21             002 
    3          02/22/20             003
    4          01/28/20             001
    5          09/05/20             002

Notice the dates for the failed cases rows 1,3,4 have changed by -+ multiple of 7 days

How looks expected output? Are in sample data matching +-7 days? — jezrael
– jezrael, Commented Feb 22, 2021 at 12:21
Can you please clarify. Do you wish to to generate a new dataframe with a unique date? Also, what makes a "unique row" in the output_df or input_df? Is it car_Id + Date? Otherwise, the problem is not well defined.. — supercooler8
– supercooler8, Commented Feb 22, 2021 at 12:22
@supercooler8 I want to edit the input dataframe with those new dates. For each carId I need a set of dates. They all need to be different. So for instance for carID 001, if I do a groupBy the rows for this carId should all have different dates. I only need to change the dates which have failed. — uniXVanXcel
– uniXVanXcel, Commented Feb 22, 2021 at 12:25
@jezrael The expected ouput is a new input sheet with the dates fixed for the failed cases — uniXVanXcel
– uniXVanXcel, Commented Feb 22, 2021 at 12:26
I think how looks data in new DataFrame from sample data in question. — jezrael
– jezrael, Commented Feb 22, 2021 at 12:26

jezrael · Accepted Answer · 2021-02-23 06:02:43Z

2

Use custom function for add +- 7 days to rows with FAIL:

output_df['output_date'] = pd.to_datetime(output_df['output_date'])
input_df['input_date'] = pd.to_datetime(input_df['input_date'])

cases = output_df.loc[output_df['ouput_result'] =='FAIL', 'case_id']
print (cases)
0    1
2    3
3    4
Name: case_id, dtype: int64

def func(dates):

    #count number of failed rows
    count = len(dates)
    
    #generate range by count of failed rows, multiple 7 (omited 0)
    arr = np.arange(1, count + 1) * 7
    #shuffling for random
    np.random.shuffle(arr)
    #generated timedeltas for add or subtract
    td = pd.to_timedelta(arr, unit='d')
    less = dates - td
    more = dates + td
    #randomly add or subtract
    rand = np.random.randint(2, size=count, dtype=bool)

    #return +- 7 days
    return np.where(rand, less, more)

#filter by cases
mask = input_df['case_id'].isin(cases)
input_df.loc[mask, 'input_date'] = (input_df[mask].groupby('input_carId')['input_date']
                                                  .transform(func))

print (input_df)
   case_id input_date  input_carId
0        1 2021-02-03            1
1        2 2021-02-21            2
2        3 2020-02-15            3
3        4 2020-01-14            1
4        5 2020-09-05            2

edited Feb 23, 2021 at 6:02

answered Feb 22, 2021 at 12:30

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

23 Comments

uniXVanXcel Over a year ago

first Thank you. But I do not want to merge input_df and output_df. I want to have the old separate input_df but with the new dates.

jezrael Over a year ago

@uniXVanXcel - Hmmm, I only guess what need, because cannot verify my solution :(

uniXVanXcel Over a year ago

The final result that I need is a new input_df with the new dates, that way i can run the tests again with a new input_df. Does that make sense?

jezrael Over a year ago

@uniXVanXcel unique datetimes has to be only new added values per groups? Or unique new + old not changed datetimes ?

uniXVanXcel Over a year ago

yes unique in new + old for each groupBy(case_id)

|

Collectives™ on Stack Overflow

Pandas Dataframe comparison

1 Answer 1

23 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

23 Comments

Your Answer

Sign up or log in

Post as a guest

Related