2

I have a variable foo, which points to a string, "bar"

foo = "bar"

I have a list, called whitelist. If whitelist is not empty, the elements contained are a whitelist. If whitelist is empty, then the if statement permits any string.

I have implemented this as follows

whitelist = ["bar", "baz", "x", "y"]

if whitelist and foo in whitelist:
    print("bar is whitelisted")
    # do something with whitelisted element

if whitelist, by my understanding, checks if whitelist returns True. whitelist will be False if whitelist is empty. If whitelist contains elements, it will return True.

However, the real implementation of this contains:

  • lots of strings to check e.g. `"bar", "baz", "x", "y", "a", "b"
  • lots of whitelists to check against

Therefore, I was wondering if there is a more computationally efficient way of writing the if statement. It seems like checking the existence of whitelist each time is inefficient, and could be simplified.

5
  • Try if foo in whitelist: instead, then you are evaluating if the string is in the list. Commented Sep 12, 2022 at 5:41
  • Use a dict instead of a list of your whitelist is large. I mean, it would be, otherwise why would you care about performance. Why are you checking if whitelist exists? What value is whitelist if doesn't "exist"? I would coerce it whitelist = whitelist or [] Commented Sep 12, 2022 at 5:43
  • You could try NumPy which is computationally efficient like this np.core.defchararray.find(bar,foo)!=-1. For this, to work, you need to make a whitelist an np array. You will need to check if whitelist is empty or not like this if whitelist.size == 0: Commented Sep 12, 2022 at 5:46
  • 1
    With all things performance, measure before you fix anything, and measure your fix to ensure it helps. Copying a large array to a dict or NumPy array may be more expensive than a 1000 searches. Commented Sep 12, 2022 at 5:48
  • 1
    Also, if you do n tests, consider collecting all your foos in a set, and make whitelist a set then use foos & whitelist (intersection) to figure which foos are in your whitelist. Should be roughly len(foos) and faster then len(foos) separate lookups. Commented Sep 12, 2022 at 5:54

3 Answers 3

2

These are some ways to check whether an element is in a list or not.

from timeit import timeit
import numpy as np




whitelist1 = {"bar", "baz", "x", "y"}
whitelist2 = np.array(["bar", "baz", "x", "y"])

def func1():
    return {"foo"}.intersection(whitelist1)

def func2():
    return "foo" in whitelist1

def func3():
    return np.isin('foo',whitelist1)


def func4():
    return whitelist2[np.searchsorted(whitelist2, 'foo')] == 'foo'




print("func1=",timeit(func1,number=100000))
print("func2=",timeit(func2,number=100000))
print("func3",timeit(func3,number=100000))
print("func4=",timeit(func4,number=100000))

Time Taken by each function

func1= 0.01365450001321733
func2= 0.005112499929964542
func3 0.5342871999600902
func4= 0.17057700001168996

FOr randomly generated list

from timeit import timeit
import numpy as np
import random as rn
from string import ascii_letters


# randomLst = for a in range(500) rn.choices(ascii_letters,k=5)

randomLst = []
while len(randomLst) !=1000:
    radomWord = ''.join(rn.choices(ascii_letters,k=5))
    if radomWord not in randomLst:
        randomLst.append(radomWord)


whitelist1 = {*randomLst}
whitelist2 = np.array(randomLst)
randomWord = rn.choice(randomLst)
randomWords = set(rn.choices(randomLst, k=100))


def func1():
    return {randomWord}.intersection(whitelist1)

def func2():
    return randomWord in whitelist1

def func3():
    return np.isin('foo',whitelist1)


def func4():
    return whitelist2[np.searchsorted(whitelist2, randomWord)] == randomWord


def func5():
    return randomWords & whitelist1

print("func1=",timeit(func1,number=100000))
print("func2=",timeit(func2,number=100000))
print("func3",timeit(func3,number=100000))
print("func4=",timeit(func4,number=100000))
print("func5=",timeit(func5,number=1000)) # Here I change the number to 1000 because we check the 100 randoms word at one so number = 100000/100 = 1000.

Time taken

func1= 0.012835499946959317
func2= 0.005004600039683282
func3 0.5219665999757126
func4= 0.19900090002920479
func5= 0.0019264000002294779

Conclusion

  1. If you want to check only one word then 'in' statement is fast

  2. But, if you have a list of word then '&' statement is fast 'func5'

Note: function 5 returns a set with the words that are in the whitelist

Sign up to request clarification or add additional context in comments.

11 Comments

Good answer but it's highly sensitive to number of lookups and len(whitelist). So not sure if it really answers anything but it's a good framework for op to evaluate. You should add a case where whitelist is already a set (perhaps instead of func1?)
If you want to be fancy, randomly generating whitelist of size n, then generate m * p samples from whitelist and m * (1 -p) not in whitelist. Now you outsources most if not all of assumptions of the data. If you do it, I am curious what the two set intersects yield opposed to 100k individual lookups.
Eliminate the print as it adds noise in your measurement, so just def func1(): return {"foo"}.intersection(whitelist1).
You may still be right if you count the time to build the randomWords set (we don't know if it should or not). Thanks for being awesome with the revised code.
@AllanWind I will add your solution also I forget it.
|
1

whitelist would exist, but if it's possible None coerce with:

whitelist = whitelist or []

As shared above then you can just foo in whitelist to figure out if it's in the list. This is O(len(whitelist)) operation. Arrays are surprisingly fast (say, for at least len(whitelist) >= 1,000) in practice.

If you need it to be faster use a set, and optionally if you need to do n lookup collect your foos into a set then use intersect for O(n):

foos = { 'bar', 'none' }
whitelist = { 'bar' }
for foo in foos & whitelist:
   print(foo)

Comments

0

Here is the simplified solution, You can do that with two methods

whitelist = ["bar", "baz", "x", "y"]
foo = "bar"
# method 1
def WhiteListExists(foo, whitelist):
    if whitelist and foo in whitelist:
        return True
    else:
        return False

exists = WhiteListExists(foo,whitelist)

# method 2
exists = True if whitelist and foo in whitelist else False

Both methods do the same but the second one is fast.

2 Comments

Is there any difference between this and direct return whitelist and foo in whitelist?
The difference is 1st method is using python functions and 2nd method uses python comprehensions

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.