I have 2 very large dataframes +20k rows. df_input and df_output.
df_input is made of test cases ; df_output is filled with the results from those test cases.
I need to select all the case numbers which failed from df_output and then fix those cases in the df_input dataframe. The fix is selecting a new unique date for each case_id.
To select a new unique date it has to be within 7*k days of the prior date, before or after. So I need to use Datetime.
Basically, I want to do this: select fail cases number from the output result
=> output_sheet[output_sheet[output_result =='FAIL']]
=> get the results in some array or vector **(how ? )**
go to input_sheet, do
=> input_df.groupBy(input_carId)
=> replace the failing dates with a new unique date within +-7k days of that old date
but it has to be unique date for that input_carId. So I think I need to use unique().
I cannot use the output_df as input_df; they're 2 very different sheets. I greatly simplified their schema here, they only share 3 columns. And also, they actually are +20000 such rows and ids
In the end I have the old input_df but changed with the new dates.
output_df
case_id output_date output_carId ouput_result
1 01/20/21 001 FAIL
2 02/21/21 001 SUCCESS
3 02/08/20 003 FAIL
4 01/07/20 001 FAIL
5 09/05/20 002 SUCCESS
input_df (old)
case_id input_date input_carId
1 01/20/21 001
2 02/21/21 002
3 02/08/20 003
4 01/07/20 001
5 09/05/20 002
expected result =>
input_df (new)
case_id input_date input_carId
1 01/13/21 001
2 02/21/21 002
3 02/22/20 003
4 01/28/20 001
5 09/05/20 002
Notice the dates for the failed cases rows 1,3,4 have changed by -+ multiple of 7 days