0

I recently signed up for a scientific python course and it was shown in classroom how a numpy array can perform better than a list in some situations. For a statistical simulation, I tried the two approaches and surprinsingly, the numpy array takes a lot longer to finish the process. Could someone help me find my(possible) mistake?

My first thought was that probably something is wrong with the way the code is written, but I can't figure out how it can be wrong. The script calculates how many attempts on average someone needs to complete a collection of sorted stickers:

Python list

I used a function and no external modules.

import random as rd
import statistics as st

def collectStickers(experiments, collectible):

    obtained = []
    attempts = 0

    while(len(obtained) < collectible):

        new_sticker = rd.randint(1, collectible)
        if new_sticker not in obtained:
            obtained.append(new_sticker)
        attempts += 1

    experiments.append(attempts)


experiments = []
collectible = 20
rep_experiment = 100000


for i in range(1, rep_experiment):
    collectStickers(experiments, collectible)

print(st.mean(experiments))

Results

The processing time seems ok for a simple experiment like this one, but for more complex purposes, 13.8 seconds is too much.

72.06983069830699
[Finished in 13.8s]

Numpy

I could not use any function as the following errors showed up when I followed the same logic as above:

  • RuntimeWarning: Mean of empty slice.

  • RuntimeWarning: invalid value encountered in double_scalars

So I just went for the naive way:

import random as rd
import numpy as np

experiments = np.array([])
rep_experiment = 100000


for i in range(1, rep_experiment):
    obtained = np.array([])
    attempts = 0
    while(len(obtained) < 20):

        new_sticker = rd.randint(1, 20)
        if new_sticker not in obtained:
            obtained = np.append(obtained, new_sticker)
        attempts += 1

    experiments = np.append(experiments, attempts)

print(np.mean(experiments))

Results

Almost 4x slower!

Is the difference in the use of a function?

72.03112031120311
[Finished in 54.2s]
3
  • 1
    @TedKleinBergman Over 100 000 iterations, that shouldn't matter very much at all. Commented Mar 16, 2021 at 15:45
  • @TedKleinBergman. That really has nothing to with it. Commented Mar 16, 2021 at 15:47
  • 1
    Direct translations of list oriented code to numpy are usually are slower. numpy is best when using its whole-array methods. Iterating through an array one by one is slow, as is building an array iteratively (np.append is not a list append clone). np.random is fast - if you ask for 1000s of random numbers at once. Effective use of numpy requires looking at the problem in a different way, from the top down. Commented Mar 16, 2021 at 17:24

2 Answers 2

4

To really take into account the power of numpy arrays, you need to program in a numpy way. For example, try to vectorize the experiment like this:

def vectorized():
    rep_experiment = 100000
    collectible = 20
    # array of falses
    obtained = np.zeros(rep_experiment, dtype=bool)
    attempts = np.zeros(rep_experiment, dtype=int)

    targets = np.zeros((rep_experiment, collectible), dtype=bool)

    x = np.arange(0,100000, step=1, dtype=int)

    while False in targets:
        r = np.random.randint(0, collectible, size=rep_experiment)
        # add the new stickers to the collected target
        targets[x,r] = True
        # if collected all set obtained to True
        obtained[np.sum(targets, axis=1)==collectible] = True
        # increments the not obtained values
        attempts[~obtained] += 1


    print(attempts.mean(), attempts.std())

check the speed up, it is around 50 X for me

Sign up to request clarification or add additional context in comments.

Comments

3

np.append copies the array before appending to it.

Your program is probably spending most of its time doing those unnecessary copies in

experiments = np.append(experiments, attempts)

EDIT

As expected, replacing the quadratic-esque np.append() with a predefined array makes the wrapper function approximately the same speed.

Replacing the list of obtained stickers with a set makes things a little faster.

However, the bottleneck is a slow random number generator. Running through cProfile reveals that 75% of the execution time is spent in randint().

See below the code for the results (on my machine).

import random
import statistics
import timeit

import numpy as np

collectible = 20
rep_experiment = 10000


def original_collect_stickers():
    obtained = []
    attempts = 0

    while len(obtained) < collectible:
        new_sticker = random.randint(1, collectible)
        if new_sticker not in obtained:
            obtained.append(new_sticker)
        attempts += 1
    return attempts


def set_collect_stickers():
    obtained = set()
    attempts = 0
    n = 0

    while n < collectible:
        new_sticker = random.randint(1, collectible)
        if new_sticker not in obtained:
            obtained.add(new_sticker)
            n += 1
        attempts += 1
    return attempts


def repeat_with_list(fn):
    experiments = []
    for i in range(rep_experiment):
        experiments.append(fn())
    return statistics.mean(experiments)


def repeat_with_numpy(fn):
    experiments = np.zeros(rep_experiment)
    for i in range(rep_experiment):
        experiments[i] = fn()
    return np.mean(experiments)


def time_fn(name, fn, n=3):
    time_taken = timeit.timeit(fn, number=n) / n
    result = fn()  # once more to get the result too
    print(f"{name:15}: {time_taken:.6f}, result {result}")


for wrapper in (repeat_with_list, repeat_with_numpy):
    for fn in (original_collect_stickers, set_collect_stickers):
        time_fn(f"{wrapper.__name__} {fn.__name__}", lambda: wrapper(fn))

The result is

repeat_with_list original_collect_stickers: 0.747183, result 71.7912
repeat_with_list set_collect_stickers: 0.688952, result 72.1002
repeat_with_numpy original_collect_stickers: 0.752644, result 72.0978
repeat_with_numpy set_collect_stickers: 0.685355, result 71.7515

EDIT 2

Using the fastrand library's pcg32bounded() generator, i.e. new_sticker = fastrand.pcg32bounded(collectible) makes things plenty fast:

repeat_with_list original_collect_stickers: 0.761186, result 72.0185
repeat_with_list set_collect_stickers: 0.690244, result 71.878
repeat_with_list set_collect_stickers_fastrand: 0.116410, result 71.9323
repeat_with_numpy original_collect_stickers: 0.759154, result 71.8604
repeat_with_numpy set_collect_stickers: 0.696563, result 71.5482
repeat_with_numpy set_collect_stickers_fastrand: 0.114212, result 71.6369

1 Comment

Thank you for such a complete answer, it was really helpful!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.