Optimizing/improving speed of simple bit of Python code

Question

I have a list of customers, and I wish to return a sorted list of the customers that occur more than 5% of the time in the original list. The following works, but I need to optimize it. Unfortunately, I am unable to discover how to (drastically) improve the time efficiency. Any suggestions?

def mostActive(customers):
    unique_customers = set(customers)
    count = len(customers)
    result = []
    for customer in unique_customers:
        if customers.count(customer) / count >= .05:
            result.append(customer)
    return sorted(result)

Try list comprehension. It will make your code more simple :) — Jaeho Song
– Jaeho Song, Commented Jul 21, 2020 at 10:05

Riccardo Bucco · Accepted Answer · 2020-07-21 09:57:02Z

2

Here is a possible solution:

from collections import Counter

def mostActive(customers):
    return sorted(customer
                  for customer, count in Counter(customers).items()
                  if count / len(customers) >= .05)

Using collections.Counter to count the occurrences of each element in the list dramatically increases the performance. Indeed you count the occurrences just once, in linear time. So here the complexity is O(n) + O(nlog(n)) (sorting the results is now the bottleneck).

answered Jul 21, 2020 at 9:57

Riccardo Bucco

15.5k4 gold badges29 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

GPhilo · Accepted Answer · 2020-07-21 10:23:24Z

When talking performance, testing is key. Here's the runtime (on my machine, ofc) of some of the codes presented in the other answers, the original code and a numpy-based answer (all codes run on the same data):

import numpy as np
from collections import Counter
import time

random_data = np.random.randint(0, 100, [100, 100000])

#### Original code
def mostActive(customers):
    unique_customers = set(customers)
    count = len(customers)
    result = []
    for customer in unique_customers:
        if customers.tolist().count(customer) / count >= .05: # had to add .tolist() to make it compatible with the data
            result.append(customer)
    return sorted(result)

start = time.time()
for i in range(100):
  _ = mostActive(random_data[i])
end = time.time()
print(f'Avg time: {(end-start)*10} ms') # would be /100*1000 -> simplified to *10
# Avg time: 1394.4847583770752 ms

#### Sorted + Counter

def mostActive(customers):
    return sorted(customer
                  for customer, count in Counter(customers).items()
                  if count / len(customers) >= .05)

start = time.time()
for i in range(100):
  _ = mostActive(random_data[i])
end = time.time()
print(f'Avg time: {(end-start)*10} ms')
# Avg time: 16.061179637908936 ms

#### Numpy

start = time.time()
for i in range(100):
  unique_elements, counts = np.unique(random_data[i], return_counts=True)
  active = sorted(unique_elements[counts > 0.05*len(random_data[i])])
end = time.time()
print(f'Avg time: {(end-start)*10} ms')
# Avg time: 3.5660386085510254 ms

Unsurprisingly, the numpy-only solution is ligtning-fast (thanks to the underlying high-performance C implementation)

RakeshV · Accepted Answer · 2020-07-21 10:00:39Z

0

Try This:

list(set([name for name in customers if customers.count(name)/len(customers)>=0.05]))

answered Jul 21, 2020 at 10:00

RakeshV

4644 silver badges12 bronze badges

1 Comment

Riccardo Bucco Over a year ago

This does not improve the performance as you are still counting every time. The complexity is at least quadratic.

Collectives™ on Stack Overflow

Optimizing/improving speed of simple bit of Python code

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related