1

i would like some help with optimizing a piece of code. First, I will provide some context. I have a DataFrame with 23k rows and 33 columns which represents reservations and another dataframe with 2k rows and 50 columns which represent properties. What I would like happenning is to understand monthly GBV (Gross Booking Value) and monthly booked nights. The tricky part is that "monthly" inclines the following:

If a reservation is from 25th of a month until the 5th of the next month and the ADR (Average Daily Rate) is 50 dollars, the result of this reservation is going to be the following:

Nights Booked in 1st month: 7 nights (from 25th to 31st) Nights Booked in 2nd month: 5 nights (from 1st to 5th) GBV for 1st month: 7 nights * 50 dollars = 350 dollars GBV for 2nd month: 5 nights * 50 dollars = 250 dollars

Sample data would be: In reservation dataframe: Arrival_date, Departure_date, Accommodation_amount In properties dataframe: GBV column for each month and Nights booked column for each month

Here is the piece of code that I am using currently:

    i=0
while i < len(kpi):
    if kpi['cod_reservation_status'][i] == 'CONF':
        dateRange = pd.date_range(kpi['arrival'][i], kpi['departure'][i], freq='d', closed='left')
        index = prop_index(int(kpi['id_room'][i]), props.id_room_type)
        nGBV = kpi['accommodation_total_amount'][i] / len(dateRange)
        for adate in dateRange:
            if adate.month == 1 and adate.year == 2022:
                props['January Booked'][index] = props['January Booked'][index] + 1
                props['January GBV'][index] = props['January GBV'][index] + nGBV
            if adate.month == 2 and adate.year == 2022:
                props['February Booked'][index] = props['February Booked'][index] + 1
                props['February GBV'][index] = props['February GBV'][index] + nGBV
            if adate.month == 3 and adate.year == 2022:
                props['March Booked'][index] = props['March Booked'][index] + 1
                props['March GBV'][index] = props['March GBV'][index] + nGBV
            if adate.month == 4 and adate.year == 2022:
                props['April Booked'][index] = props['April Booked'][index] + 1
                props['April GBV'][index] = props['April GBV'][index] + nGBV
            if adate.month == 5 and adate.year == 2022:
                props['May Booked'][index] = props['May Booked'][index] + 1
                props['May GBV'][index] = props['May GBV'][index] + nGBV
            if adate.month == 6 and adate.year == 2022:
                props['June Booked'][index] = props['June Booked'][index] + 1
                props['June GBV'][index] = props['June GBV'][index] + nGBV
            if adate.month == 7 and adate.year == 2022:
                props['July Booked'][index] = props['July Booked'][index] + 1
                props['July GBV'][index] = props['July GBV'][index] + nGBV
            if adate.month == 8 and adate.year == 2022:
                props['August Booked'][index] = props['August Booked'][index] + 1
                props['August GBV'][index] = props['August GBV'][index] + nGBV
            if adate.month == 9 and adate.year == 2022:
                props['September Booked'][index] = props['September Booked'][index] + 1
                props['September GBV'][index] = props['September GBV'][index] + nGBV
            if adate.month == 10 and adate.year == 2022:
                props['October Booked'][index] = props['October Booked'][index] + 1
                props['October GBV'][index] = props['October GBV'][index] + nGBV
            if adate.month == 11 and adate.year == 2022:
                props['November Booked'][index] = props['November Booked'][index] + 1
                props['November GBV'][index] = props['November GBV'][index] + nGBV
            if adate.month == 12 and adate.year == 2022:
                props['December Booked'][index] = props['December Booked'][index] + 1
                props['December GBV'][index] = props['December GBV'][index] + nGBV
    i+=1

That works and does the job BUT it is quite slow. On 22k entries, it takes me 2 and a half minutes. I am sure there is a way to optimize this. Any ideas are welcome. What are the best practices in optimizing loops and how would you go about this specific task?

PS: kpi = reservations DataFrame, props = properties DataFrame

2
  • 1
    Can you provide some sample data and expected results? I know your looking for optimization, but that would help understand the larger scope of the project and help with some code rework as well Commented May 19, 2022 at 16:02
  • I will provide but tomorrow morning as I have left work and my files are at my work computer. Thank you for taking the time! Commented May 19, 2022 at 18:09

3 Answers 3

2

With pandas you must not use while and for loop where it is possible to use pandas built-in function which are written in C code and compiled (quicker than interpreted).

Try to uses props.loc[:,"column"].apply("the function you want to apply")

With pandas you must think the entire dataset and not iterate objects.

Look into the documnetation for df.loc[row,col] and df.apply()

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the answer. I will try implementing the .apply and .loc into the function and will share the results.
1

Pandas, alike Numpy, are thought for vector operations. Avoid as much as you can iterations, think of "block" equations.

Some tips:

  • Use constructs like query for fast data access, like df.query("reserved=True")
  • Besides apply() you can "explode()" the dateranges: That will create new rows where you can apply operations in parallel (adate = bookeddate)
  • ...

With Pandas/Numpy the only point where for/while loops are encouraged is when using Numba or Cython

Comments

0

I noticed that you have this:

i=0
while i < len(kpi):

A while loop is slower than a for loop. Because a while loop checks the condition after each iteration.

So you should replace it with this:

for i in range(len(kpi) - 1)

The reason you need to subtract one is because you want i to always be smaller than len(kpi) because when you use the index of something you start with 0

7 Comments

Maybe provide a link or something supporting this? I'd believe it, but "I saw a video that said x" isn't a particularly strong answer.
For a loop that does nothing, for is indeed faster than the corresponding while loop. But in OP’s code, which loop you choose makes zero difference for performance. (Of course OP should still use a for loop simply because it’s more readable.)
Should I delete my answer then?
While loops evaluate a condition. A for loop has to call its iterator's __next__ method; it's not free. In any case, you would rarely, if ever, use range(len(kpi)) over enumerate(kpi).
I wrote this few months back and I remember encountering an error when I tried doing the outer loop with for i in range(Len(kpi)) which is why I proceeded with for loop and counter. I also have other parts of my code that need reworking and optimizing but I want to take it one sip at a time.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.