Efficiency of checking for string in querystring parameters in python

Question

I am running a query that needs to exclude all previously seen items and return a certain number of items (10 in this case). The list of items is sent in a querystring parameter and are uuid4 strings separated by '&'.

With a possible very large database, I don't think it makes sense to add the exclude statement with the query as most of the results wont be in the alreadySeenItems list and the database is pretty hefty.

Which of the following methods would be faster, assuming that the alreadySeenList can be pretty large >1000 items in some case.

Note: I would normally just use the first one, but since I need an exact match and I know where each word starts, it might make sense to do otherwise

def getItems(request, alreadySeenItems):
newItems = []
allPossibleItems = Items.objects.raw('SELECT "id" FROM items_table WHERE conditions;')

# method 1
for item in allPossibleItems:
    if item.id not in alreadySeenItems:
        newItems.append(item.id)
        if len(newItems) > 10:
            return newItems

# method 2
alreadySeenItemsList = alreadySeenItems.split('&')
for item.id in allPossibleItems:
    if not checkForItemInList(item.id, alreadySeenItems)
        newItems.append(item.id)
        if len(newItems) > 10:
            return newItems

# method 3
alreadySeenItemsSet = set(alreadySeenItems.split('&'))
for item.id in allPossibleItems:
    if not item.id in alreadySeenItemsSet
        newItems.append(item.id)
        if len(newItems) > 10:
            return newItems

def checkForItemInList(item, items):
    for tmp in items:
        if item == tmp:
            return True
    return False

Why not use timeit on some representative data and find out? — mgilson
– mgilson, Commented Jul 1, 2016 at 21:48

dcc310 · Accepted Answer · 2016-07-01 22:11:41Z

1

Whatever you are doing will be much faster if you use Python's built in set data structure. Checking for inclusion then is very fast, if you have some memory available, but 1k items is nothing. see https://docs.python.org/2/library/stdtypes.html#set

Not to dodge the original question, but we also can't be sure if your database is truly large. Have you added indexes to your table? As far as most of the results being of one nature or another leading to some inefficiency, some systems (Postgresql at least) allow for "partial indexing" which allows you to build an index on just some values, which can make things a little leaner - but I'm not completely sure if this is relevant to you.

Anyways, here is an example:

import random 

num_elements = 10000

def checkForItemInList(item, items):
    for tmp in items:
        if item == tmp:
            return True
    return False

def run_experiment(values, use_set):
    for i in range(num_elements):
        # We know this will be in there...
        random_elem = random.randint(0, num_elements - 1)
        if use_set:
            random_elem in values
        else:
            checkForItemInList(random_elem, values)


values = range(num_elements)
values_as_set = set(values)
#run_experiment(values_as_set, True)
run_experiment(values, False)

Using the set=True version (commented out) gives me about .05 seconds on average, while the non-set version is about 2.5 seconds.

edited Jul 1, 2016 at 22:11

answered Jul 1, 2016 at 21:53

dcc310

1,1061 gold badge9 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

dan arters Over a year ago

The database is currently 50,000 rows but is still in beta phase and I'm expecting it to be well in the millions of rows. The sql indexing wouldn't work in this situation because I would need to compare every item in the alreadySeenList to each row in the query, unless I'm misunderstanding how that sql query works

dcc310 Over a year ago

Sorry, I'm having a little trouble understanding your example. Could you maybe post a less abstract example of your problem, perhaps about your real subject matter or cat pictures or blog posts?

dan arters Over a year ago

I uploaded the actual piece of code I use without the conditions because those arent relevant for this example. This is a big database of items that a user may or may not have seen and I need to grab a small set of ones that they havent seen that also conform to some conditions

dcc310 Over a year ago

Assuming good indexes, even with millions of rows, if you have a column like item_id in your big master table of items, and you have a table like user_views with a user_id and a item_id_seen column, then you could do something like SELECT i.item_id FROM items i LEFT JOIN user_views uv ON uv.item_id = i.item_id WHERE uv.item_id is null LIMIT 10 should find 10 new items. If the number of views is small per user, or can be limited by time, then SELECT i.item_id WHERE item_id NOT IN (SELECT item_id_seen FROM user_views WHERE user=myuser) LIMIT 10 could work

dcc310 Over a year ago

Also, if you have tons of potential items, and users tend to never see more than a few percent of them, and you just want 10 random unseen items, you have good odds of just selecting, say 1k IDs from your potential ids, slurping your user's prior views into memory, and checking each of the 1k candidates. The odds that you won't get 10 new ones from your batch of 1k would be very small...

Collectives™ on Stack Overflow

Efficiency of checking for string in querystring parameters in python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related