2

What is the most efficient way to keep track of data in a ascending/descending order. Lets say I have a stream of data, assume this is very large. Sample stream:

key,mod,value
5,add,1
2,add,3
4,add,2
2,add,2
2,rem,5

As I am reading the stream, I place it in a dictionary to keep track of the contents. For example, at the end of the mini stream above, I will have a dictionary with {5:1, 4:2}. Where add means the value is increased by the value, and rem means you are removing that much from that key. If the value goes to 0, then delete that key from the dictionary. But I also want to be able to print the data in order (but doesnt have to be all the time.) I do want to keep track of the highest/lowest keys, so that I know when the highest/lowest value has changed. Either the key changed or its value changed.

The way I am doing it right now is fill/delete keys from dictionary accordingly. This should be constant O(1). Keep track of a sorted_keys list, where each stream checks if the new number is in the dictionary, if not, will do bisect.insort_right(sorted_keys, key). So the sorted_keys is always sorted at all times. Assuming adding 1 value in a sorted list is quick, though it does need to extend the size so this may take O(n) still. And i keep track of prev_highest or prev_lowest, and check this against sorted_keys[0] or sorted_keys[-1] respectively.

I tried using deque with bisect.insort_right, SortedDict from sortedcontainers, linked list, OrderedDict, but it seems that the above seems to work best. Is there another potential implementation that could be more optimized? Or should I just keep track of a certain level in order, say 10 items in order. And update that accordingly. But the problem with this is that if there is a new key, how do i know its one of the new top or not? Seems like having a heapq would help, but I cant get the sorted values until i pop them. And if I need to print the whole thing in order, I just sort the keys of the entire dictionary.

Edit: Adding my tests using bisect and SortedDict below:

import timeit
import bisect
import random
from sortedcontainers import SortedDict

NUM_ITERATION_TEST = 10
TOTAL_NUM_DATA = 1000000
MODS = ['add', 'rem']
QUANTITY = [1, 5, 10, 20, 100, 200, 300, 500, 1000]

DATA = [{'mod': random.choice(MODS),
         'key': random.randint(0, 1000),
         'val': random.choice(QUANTITY)} for x in range(TOTAL_NUM_DATA)]


def method1(DATA):
    d = {}
    sorted_keys = []

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
                bisect.insort_right(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)           
                else:
                    d[key] -= data['val']
            else:
                pass # Deleting something not there yet

def method2(DATA):
    d = SortedDict()

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet


if __name__ == "__main__":
    # METHOD 1
    print("Method 1 Execution Time:")
    print(timeit.timeit("test_timeit.method1(test_timeit.DATA)",
                        number=NUM_ITERATION_TEST,
                        setup="import test_timeit"))

    # METHOD 2
    print("Method 2 Execution Time:")
    print(timeit.timeit("test_timeit.method2(test_timeit.DATA)",
                        number=NUM_ITERATION_TEST,
                        setup="import test_timeit"))

The results for the above is:

Method 1 Execution Time:
4.427699800000001
Method 2 Execution Time:
12.7445671
12
  • It is hard to give a general answer to the sorting problem. Sorting is one of larger problems in computer science and many algorithms have been created. Just to name one concept to store data in an ordered fashion: b-trees. Commented Sep 1, 2020 at 1:08
  • 2
    Sorry, but the sample input appears to be gibberish - what does "add" mean" What does "rem" mean? ;How on Earth do you get {5:1, 4:2} out of those 6 lines? Is the "key,mod,value" line part of the input, or is it meant to be commentary of some kind? I've read this 4 times now, and have no real idea what you're trying to do. Commented Sep 1, 2020 at 1:10
  • If the dict is large (I know that this is vague) an sqlite database with an index may also be an option (maybe hold in memory). The indexes are B-trees and automatically kept up to date by the database. Commented Sep 1, 2020 at 1:11
  • 1
    @TimPeters I updated the question. Add means add that value to that key, and rem means remove that value from that key. The headers is more to just describe each columns Commented Sep 1, 2020 at 1:18
  • 1
    Regardless of dict flavor, never test for membership with k in thedict.keys(). That requires building a keys object every time, which is pure wasted time and space. Do k in thedict instead, which directly calls the primitive thedict.__contains__(k) method. Commented Sep 1, 2020 at 2:42

2 Answers 2

3

For data that fits in memory, "SortedDict from sortedcontainers" (which you already tried) is generally as good as it gets for keeping such a dict in sorted order. But lookup time is (roughly) O(log N) (see edit at the end - that appears to be false!).

Assuming adding 1 value in a sorted list is quick, though it does need to extend the size so this may take O(n) still.

In a Python list L, inserting an element at index i has to - at a minimum - physically move len(L) - i pointers, which means 8x times that many bytes on a 64-bit box. That's where sortedcontainers gets its huge wins when data gets "large": the worst-case number of pointers it needs to physically move is bounded by a constant independent of len(L). Until len(L) gets into the thousands, it can be hard to notice the difference. But when len(L) gets into the millions, the difference is huge.

I'd try a compromise: use a sortedcontainers SortedList to keep track of the current keys, and a plain Python dict for the actual dict. Then:

For "key add value": see if the key is in the dict. Very fast. If it is, no need to touch the SortedList. Just mutate the dict. If the key is not in the dict, then it needs to be added to the SortedList and the dict.

For "key rem value": see if the key in the dict. If it's not, I have no idea what you want to do, but you'll figure it out ;-) But if it is in the dict, subtract the value. If the result is non-zero you're done. Else (result is 0), remove the key from the dict and the SortedList.

Note: I'm suggesting SortedList instead of SortedSet not for semantic reasons, but because SortedSet requires more memory, to maintain a set in parallel with a sorted list. You have no use for the set.

What you may really want in addition to a dict is a double-ended ("min max") heap. It's impossible to guess from what you've said - it depends on, e.g., how often you only want to know "the smallest and/or the largest" compared to often you want to materialize the entire sorted order. But I don't know of a min-max heap implementation for Python that's built for speed - they're messy beasts to code, and rarely used.

EDIT

On second thought, it seems that a sortedcontainer's SortedDict already combines a SortedList with (a subclass of) a plain Python dict. For example, setting a value in a SortedDict is implemented like so:

def __setitem__(self, key, value):
    if key not in self:
        self._list_add(key)
    dict.__setitem__(self, key, value)

So it only touches the SortedList if the key isn't already in the dict. There's not much opportunity for improving on this if you maintain your own <SortedList, dict> pair.

ROLLING YOUR OWN

Here's another to try:

def method3(DATA):
    sorted_keys = SortedList()
    d = {}

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d:
                d[key] += data['val']
            else:
                d[key] = data['val']
                sorted_keys.add(key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d:
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet

This implements my original suggestion: maintain your own pair of a SortedList with a plain Python dict. It has the same O() behavior as using a SortedDict, but appears significantly faster by a constant factor. The seems partly because the dict operations are all coded in C now (SortedDict is coded in Python), and the rest because we only test for dict membership once per data item. For example, in

if key in d:
    d[key] += data['val']

when d is a SortedDict, key in d tests it once explicitly, but the implementation of d.__setitem__() has to test it again, so that it can add key to its hidden SortedList if the key is unknown. From our higher-level view, we already know the key is in the dict in the body of the if, so can ignore our explicit SortedList entirely there.

Sign up to request clarification or add additional context in comments.

8 Comments

Yea i used SortedDict, thinking it'll be better. Unfortunately, it is not. Let me update my question with my profile testing. Unless I messed up something there
The closest you got quantifying your problem size is "very large" - but nobody knows what that means to you ;-)
True, I guess thats vague. But seems no matter what value I put in the test above, the bisect implementation will still win. Please let me know if you see something different
Woohoo, I've beaten that Tim Peters in something sorting! I can retire now :-)
@superb rain, heh - read the comments on the original post all the way through. The original problem instance is nearly trivial, because its random.randint(0, 1000) means there are at most 1001 keys in the dict. Nothing beats a plain list for something so tiny. But changing 1000 to something even as small as 40000 makes it crawl compared to what I posted.
|
3

You made two mistakes in your method1:

  • You check if key in d.keys(): instead of if key in d:. There's no point creating a keys view there.

  • You remove from the list with sorted_keys.remove(key) instead of using bisect to find the index and then deleting that index.

Fixing those, storing some methods in local variables for shorter/faster access, and using d.get to possibly find a key and get its value (instead of an in check followed by looking up the value), I get these times (method1/2 are yours, method3 is Tim's, method4 is mine):

Round 1
method1 7.590627200000004
method2 19.851634099999984
method3 6.093115100000006
method4 5.069753999999989

Round 2
method1 7.857367500000009
method2 19.59779759999998
method3 6.057990299999972
method4 5.0046839999999975

Round 3
method1 7.843560700000012
method2 19.8673627
method3 6.079332300000033
method4 5.073929300000032

Tim challenged me to change randint(0, 1000) to randint(0, 40_000):

method1 607.2835661000001
method2 26.667593300000135
method3 12.84969140000021
method4 16.68231250000008

And randint(0, 400_000) (only the faster solutions):

method3 20.179627500000002
method4 115.39424580000002

My version:

def method4(DATA):
    d = {}
    sorted_keys = []
    insort = bisect.insort_right
    index = bisect.bisect_left
    get = d.get
    
    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            val = get(key)
            if val:
                d[key] = val + data['val']
            else:
                d[key] = data['val']
                insort(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            val = get(key)
            if val:
                if val <= data['val']:
                    del d[key]
                    del sorted_keys[index(sorted_keys, key)]
                else:
                    d[key] = val - data['val']
            else:
                pass # Deleting something not there yet

Full benchmark code including correctness checking:

import timeit
import bisect
import random
from sortedcontainers import SortedDict, SortedList

NUM_ITERATION_TEST = 10
TOTAL_NUM_DATA = 1000000
MODS = ['add', 'rem']
QUANTITY = [1, 5, 10, 20, 100, 200, 300, 500, 1000]

DATA = [{'mod': random.choice(MODS),
         'key': random.randint(0, 1000),
         'val': random.choice(QUANTITY)} for x in range(TOTAL_NUM_DATA)]

def method1(DATA, return_=False):
    d = {}
    sorted_keys = []

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
                bisect.insort_right(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)           
                else:
                    d[key] -= data['val']
            else:
                pass # Deleting something not there yet
    if return_:
        return d, sorted_keys

def method2(DATA, return_=False):
    d = SortedDict()

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d.keys():
                d[key] += data['val']
            else:
                d[key] = data['val']
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d.keys():
                if d[key] <= data['val']:
                    del d[key]
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet
    if return_:
        return dict(d), list(d)

def method3(DATA, return_=False):
    sorted_keys = SortedList()
    d = {}

    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            if key in d:
                d[key] += data['val']
            else:
                d[key] = data['val']
                sorted_keys.add(key)
        elif data['mod'] == 'rem':
            key = data['key']
            if key in d:
                if d[key] <= data['val']:
                    del d[key]
                    sorted_keys.remove(key)
                else:
                    d[key] -= data['val']
            else:
                pass  # Deleting something not there yet
    if return_:
        return d, list(sorted_keys)

def method4(DATA, return_=False):
    d = {}
    sorted_keys = []
    insort = bisect.insort_right
    index = bisect.bisect_left
    get = d.get
    
    for data in DATA:
        if data['mod'] == 'add':
            key = data['key']
            val = get(key)
            if val:
                d[key] = val + data['val']
            else:
                d[key] = data['val']
                insort(sorted_keys, key)
        elif data['mod'] == 'rem':
            key = data['key']
            val = get(key)
            if val:
                if val <= data['val']:
                    del d[key]
                    del sorted_keys[index(sorted_keys, key)]
                else:
                    d[key] = val - data['val']
            else:
                pass # Deleting something not there yet
    if return_:
        return d, sorted_keys

methods = method1, method2, method3, method4

expect = method1(DATA, True)
for method in methods[1:]:
    print(method(DATA, True) == expect)

if __name__ == "__main__":
    for round in range(1, 4):
        print('Round', round)
        for method in methods:
            t = min(timeit.repeat(lambda: method(DATA),
                                  number=NUM_ITERATION_TEST))
            print(method.__name__, t)
        print()

5 Comments

Now try 400000 ;-) The difference starts to get dramatic then. The quadratic-time behavior in the original was dominated by .remove(), but it's still there in method4 due to the insort() and del d[key] lines - but those run "at C speed" so it takes larger lists for those lines to utterly overwhelm everything else the program is doing.
Nah, why would I do that? I already know just as much as you do that it'll get that much slower for that reason.
Sorry, but nothing in your answer, or your comments, suggested you had any idea why I suggested SortedList to begin with. To the contrary, your answer suggested continuing to use a plain list, with changes that didn't alter O() behavior at all. But I up-voted your answer anyway ;-)
@TimPeters Right, sorry. I guess I consider it common knowledge, and my comment was supposed to be lighthearted but ended up being flippant. I do know more than I've shown, might even know something about Timsort that you don't :-). When I said I've beaten you at something sorting, I just meant the OP's benchmark as is. I did 400,000 after all now, as I wasn't able to estimate the effect on the interplay with the other parameters and the whole benchmark. I mean, it also matters how often we insert or remove. I'd say now my answer is clear that mine is significantly slower at large key sets.
@superbrain Definitely will keep your answer in mind. Definitely a great alternative (if <40k). Good idea using index when removing for sure

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.