0

I am running a query that needs to exclude all previously seen items and return a certain number of items (10 in this case). The list of items is sent in a querystring parameter and are uuid4 strings separated by '&'.

With a possible very large database, I don't think it makes sense to add the exclude statement with the query as most of the results wont be in the alreadySeenItems list and the database is pretty hefty.

Which of the following methods would be faster, assuming that the alreadySeenList can be pretty large >1000 items in some case.

Note: I would normally just use the first one, but since I need an exact match and I know where each word starts, it might make sense to do otherwise

def getItems(request, alreadySeenItems):
newItems = []
allPossibleItems = Items.objects.raw('SELECT "id" FROM items_table WHERE conditions;')

# method 1
for item in allPossibleItems:
    if item.id not in alreadySeenItems:
        newItems.append(item.id)
        if len(newItems) > 10:
            return newItems

# method 2
alreadySeenItemsList = alreadySeenItems.split('&')
for item.id in allPossibleItems:
    if not checkForItemInList(item.id, alreadySeenItems)
        newItems.append(item.id)
        if len(newItems) > 10:
            return newItems

# method 3
alreadySeenItemsSet = set(alreadySeenItems.split('&'))
for item.id in allPossibleItems:
    if not item.id in alreadySeenItemsSet
        newItems.append(item.id)
        if len(newItems) > 10:
            return newItems

def checkForItemInList(item, items):
    for tmp in items:
        if item == tmp:
            return True
    return False
1
  • Why not use timeit on some representative data and find out? Commented Jul 1, 2016 at 21:48

1 Answer 1

1

Whatever you are doing will be much faster if you use Python's built in set data structure. Checking for inclusion then is very fast, if you have some memory available, but 1k items is nothing. see https://docs.python.org/2/library/stdtypes.html#set

Not to dodge the original question, but we also can't be sure if your database is truly large. Have you added indexes to your table? As far as most of the results being of one nature or another leading to some inefficiency, some systems (Postgresql at least) allow for "partial indexing" which allows you to build an index on just some values, which can make things a little leaner - but I'm not completely sure if this is relevant to you.

Anyways, here is an example:

import random 

num_elements = 10000

def checkForItemInList(item, items):
    for tmp in items:
        if item == tmp:
            return True
    return False

def run_experiment(values, use_set):
    for i in range(num_elements):
        # We know this will be in there...
        random_elem = random.randint(0, num_elements - 1)
        if use_set:
            random_elem in values
        else:
            checkForItemInList(random_elem, values)


values = range(num_elements)
values_as_set = set(values)
#run_experiment(values_as_set, True)
run_experiment(values, False)

Using the set=True version (commented out) gives me about .05 seconds on average, while the non-set version is about 2.5 seconds.

Sign up to request clarification or add additional context in comments.

5 Comments

The database is currently 50,000 rows but is still in beta phase and I'm expecting it to be well in the millions of rows. The sql indexing wouldn't work in this situation because I would need to compare every item in the alreadySeenList to each row in the query, unless I'm misunderstanding how that sql query works
Sorry, I'm having a little trouble understanding your example. Could you maybe post a less abstract example of your problem, perhaps about your real subject matter or cat pictures or blog posts?
I uploaded the actual piece of code I use without the conditions because those arent relevant for this example. This is a big database of items that a user may or may not have seen and I need to grab a small set of ones that they havent seen that also conform to some conditions
Assuming good indexes, even with millions of rows, if you have a column like item_id in your big master table of items, and you have a table like user_views with a user_id and a item_id_seen column, then you could do something like SELECT i.item_id FROM items i LEFT JOIN user_views uv ON uv.item_id = i.item_id WHERE uv.item_id is null LIMIT 10 should find 10 new items. If the number of views is small per user, or can be limited by time, then SELECT i.item_id WHERE item_id NOT IN (SELECT item_id_seen FROM user_views WHERE user=myuser) LIMIT 10 could work
Also, if you have tons of potential items, and users tend to never see more than a few percent of them, and you just want 10 random unseen items, you have good odds of just selecting, say 1k IDs from your potential ids, slurping your user's prior views into memory, and checking each of the 1k candidates. The odds that you won't get 10 new ones from your batch of 1k would be very small...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.