0

I have 2 very large dataframes +20k rows. df_input and df_output.

df_input is made of test cases ; df_output is filled with the results from those test cases.

I need to select all the case numbers which failed from df_output and then fix those cases in the df_input dataframe. The fix is selecting a new unique date for each case_id.

To select a new unique date it has to be within 7*k days of the prior date, before or after. So I need to use Datetime.

Basically, I want to do this: select fail cases number from the output result

=> output_sheet[output_sheet[output_result =='FAIL']]
  => get the results in some array or vector  **(how ? )**

go to input_sheet, do

=> input_df.groupBy(input_carId)
=> replace the failing dates with a new unique date within +-7k days of that old date

but it has to be unique date for that input_carId. So I think I need to use unique().

I cannot use the output_df as input_df; they're 2 very different sheets. I greatly simplified their schema here, they only share 3 columns. And also, they actually are +20000 such rows and ids

In the end I have the old input_df but changed with the new dates.

output_df

case_id        output_date        output_carId   ouput_result
1                 01/20/21             001          FAIL
2                 02/21/21             001          SUCCESS  
3                 02/08/20             003          FAIL 
4                 01/07/20             001          FAIL
5                 09/05/20             002          SUCCESS

input_df (old)

case_id    input_date         input_carId  
    1          01/20/21             001  
    2          02/21/21             002 
    3          02/08/20             003
    4          01/07/20             001
    5          09/05/20             002

expected result =>

input_df (new)

   case_id   input_date         input_carId  
    1          01/13/21             001  
    2          02/21/21             002 
    3          02/22/20             003
    4          01/28/20             001
    5          09/05/20             002

Notice the dates for the failed cases rows 1,3,4 have changed by -+ multiple of 7 days

10
  • How looks expected output? Are in sample data matching +-7 days? Commented Feb 22, 2021 at 12:21
  • 1
    Can you please clarify. Do you wish to to generate a new dataframe with a unique date? Also, what makes a "unique row" in the output_df or input_df? Is it car_Id + Date? Otherwise, the problem is not well defined.. Commented Feb 22, 2021 at 12:22
  • @supercooler8 I want to edit the input dataframe with those new dates. For each carId I need a set of dates. They all need to be different. So for instance for carID 001, if I do a groupBy the rows for this carId should all have different dates. I only need to change the dates which have failed. Commented Feb 22, 2021 at 12:25
  • @jezrael The expected ouput is a new input sheet with the dates fixed for the failed cases Commented Feb 22, 2021 at 12:26
  • 1
    I think how looks data in new DataFrame from sample data in question. Commented Feb 22, 2021 at 12:26

1 Answer 1

2

Use custom function for add +- 7 days to rows with FAIL:

output_df['output_date'] = pd.to_datetime(output_df['output_date'])
input_df['input_date'] = pd.to_datetime(input_df['input_date'])

cases = output_df.loc[output_df['ouput_result'] =='FAIL', 'case_id']
print (cases)
0    1
2    3
3    4
Name: case_id, dtype: int64

def func(dates):

    #count number of failed rows
    count = len(dates)
    
    #generate range by count of failed rows, multiple 7 (omited 0)
    arr = np.arange(1, count + 1) * 7
    #shuffling for random
    np.random.shuffle(arr)
    #generated timedeltas for add or subtract
    td = pd.to_timedelta(arr, unit='d')
    less = dates - td
    more = dates + td
    #randomly add or subtract
    rand = np.random.randint(2, size=count, dtype=bool)

    #return +- 7 days
    return np.where(rand, less, more)

#filter by cases
mask = input_df['case_id'].isin(cases)
input_df.loc[mask, 'input_date'] = (input_df[mask].groupby('input_carId')['input_date']
                                                  .transform(func))

print (input_df)
   case_id input_date  input_carId
0        1 2021-02-03            1
1        2 2021-02-21            2
2        3 2020-02-15            3
3        4 2020-01-14            1
4        5 2020-09-05            2
Sign up to request clarification or add additional context in comments.

23 Comments

first Thank you. But I do not want to merge input_df and output_df. I want to have the old separate input_df but with the new dates.
@uniXVanXcel - Hmmm, I only guess what need, because cannot verify my solution :(
The final result that I need is a new input_df with the new dates, that way i can run the tests again with a new input_df. Does that make sense?
@uniXVanXcel unique datetimes has to be only new added values per groups? Or unique new + old not changed datetimes ?
yes unique in new + old for each groupBy(case_id)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.