0

Input

Client First name Last Name Start Date End Date Amount Invoice Date
XXX John Kennedy 15-01-2021 28-02-2021 137,586.00 20-04-2021
YYY Peter Paul 7-02-2021 31-03-2021 38,750.00 20-04-2021
ZZZ Michael K 10-03-2021 29-04-2021 137,586.00 30-04-2021

Code

df = pd.read_excel ('file.xlsx',parse_dates=['Start Date','End Date'] )
df['Start Date'] = pd.to_datetime(df['Start Date'],format='%d-%m-%Y')
df['End Date'] = pd.to_datetime(df['End Date'],format='%d-%m-%Y')

df['r'] = df.apply(lambda x: pd.date_range(x['Start Date'],x['End Date']), axis=1)
df = df.explode('r')
print(df)

months = df['r'].dt.month

starts, ends = months.ne(months.groupby(level=0).shift(1)), months.ne(months.groupby(level=0).shift(-1))


df2 = pd.DataFrame({'First Name': df['First name'],
           'Start Date': df.loc[starts, 'r'].dt.strftime('%Y-%m-%d'),
                   'End Date': df.loc[ends, 'r'].dt.strftime('%Y-%m-%d'),
                   'Date Diff': df.loc[ends, 'r'].dt.strftime('%d').astype(int)-df.loc[starts, 'r'].dt.strftime('%d').astype(int)+1})

df = df.loc[~df.index.duplicated(), :]

df2 = pd.merge(df, df2, left_index=True, right_index=True)

df2['Amount'] = df['Amount'].mul(df2['Date_Diff'])
print(df['Amount'])
print (df)

df.to_excel('report.xlsx', index=True)

Error ValueError: cannot reindex from a duplicate axis

Expected output

enter image description here

how to resolve this issue?

1

1 Answer 1

1

Start with some correction in your input Excel file, namely change First name to First Name - with capital "N", just like in other columns.

Then, to read your Excel file, it is enough to run:

df = pd.read_excel('Input.xlsx', parse_dates=['Start Date', 'End Date',
   'Invoice Date'], dayfirst=True)

No need to call to_datetime.

Note also that since Invoice Date contains also dates, I added this column to parse_dates list.

Then define two functions:

  1. A function to get monthly data for the current row:

    def getMonthData(grp, amnt, dayNo):
        return pd.Series([grp.min(), grp.max(), amnt * grp.size / dayNo],
            index=['Start Date', 'End Date', 'Amount'])
    

    It converts the input Series of dates (for a single month) into the "new" content of the output rows (start / end dates and the proper share of the total amount, to be accounted for this month).

    It will be called in the following function.

  2. A function to "explode" the current row:

    def rowExpl(row):
        ind = pd.date_range(row['Start Date'], row['End Date']).to_series()
        rv = ind.groupby(pd.Grouper(freq='M')).apply(getMonthData,
            amnt=row.Amount, dayNo=ind.size).unstack().reset_index(drop=True)
        rv.insert(0, 'Client', row.Client)
        rv.insert(1, 'First Name', row['First Name'])
        rv.insert(2, 'Last Name', row['Last Name'])
        return rv.assign(**{'Invoice Date': row['Invoice Date']})
    

And the last step is to get the result. Apply rowExpl to each row and concatenate the partial results into a single output DataFrame:

result = pd.concat(df.apply(rowExpl, axis=1).values, ignore_index=True)

The result, for your data sample is:

  Client First Name Last Name Start Date   End Date   Amount Invoice Date
0    XXX       John   Kennedy 2021-01-15 2021-01-31  51976.9   2021-04-20
1    XXX       John   Kennedy 2021-02-01 2021-02-28  85609.1   2021-04-20
2    YYY      Peter      Paul 2021-02-07 2021-02-28  16084.9   2021-04-20
3    YYY      Peter      Paul 2021-03-01 2021-03-31  22665.1   2021-04-20
4    ZZZ    Michael         K 2021-03-10 2021-03-31  59350.8   2021-04-30
5    ZZZ    Michael         K 2021-04-01 2021-04-29  78235.2   2021-04-30

Don't be disaffected by seemingly too low precision of Amount column. It is only the way how Jupyter Notebook displays the DataFrame.

When you run result.iloc[0, 5], you will get:

51976.933333333334

with full, actually held precision.

Sign up to request clarification or add additional context in comments.

3 Comments

VaIdi_Bo tried to execute the function , but output is blank
If you executed result = pd.concat(…) then the actual result is in result variable. To see this result, execute e.g. print(result). Check also the file name (whether this file actually contains the expected data).
Another check: Instead of reading from an Excel file, load the data from e.g. a CSV file, then exeute my code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.