What is the most efficient way to keep track of data in a ascending/descending order. Lets say I have a stream of data, assume this is very large. Sample stream:
key,mod,value
5,add,1
2,add,3
4,add,2
2,add,2
2,rem,5
As I am reading the stream, I place it in a dictionary to keep track of the contents. For example, at the end of the mini stream above, I will have a dictionary with {5:1, 4:2}. Where add means the value is increased by the value, and rem means you are removing that much from that key. If the value goes to 0, then delete that key from the dictionary. But I also want to be able to print the data in order (but doesnt have to be all the time.) I do want to keep track of the highest/lowest keys, so that I know when the highest/lowest value has changed. Either the key changed or its value changed.
The way I am doing it right now is fill/delete keys from dictionary accordingly. This should be constant O(1). Keep track of a sorted_keys list, where each stream checks if the new number is in the dictionary, if not, will do bisect.insort_right(sorted_keys, key). So the sorted_keys is always sorted at all times. Assuming adding 1 value in a sorted list is quick, though it does need to extend the size so this may take O(n) still. And i keep track of prev_highest or prev_lowest, and check this against sorted_keys[0] or sorted_keys[-1] respectively.
I tried using deque with bisect.insort_right, SortedDict from sortedcontainers, linked list, OrderedDict, but it seems that the above seems to work best. Is there another potential implementation that could be more optimized? Or should I just keep track of a certain level in order, say 10 items in order. And update that accordingly. But the problem with this is that if there is a new key, how do i know its one of the new top or not? Seems like having a heapq would help, but I cant get the sorted values until i pop them. And if I need to print the whole thing in order, I just sort the keys of the entire dictionary.
Edit: Adding my tests using bisect and SortedDict below:
import timeit
import bisect
import random
from sortedcontainers import SortedDict
NUM_ITERATION_TEST = 10
TOTAL_NUM_DATA = 1000000
MODS = ['add', 'rem']
QUANTITY = [1, 5, 10, 20, 100, 200, 300, 500, 1000]
DATA = [{'mod': random.choice(MODS),
'key': random.randint(0, 1000),
'val': random.choice(QUANTITY)} for x in range(TOTAL_NUM_DATA)]
def method1(DATA):
d = {}
sorted_keys = []
for data in DATA:
if data['mod'] == 'add':
key = data['key']
if key in d.keys():
d[key] += data['val']
else:
d[key] = data['val']
bisect.insort_right(sorted_keys, key)
elif data['mod'] == 'rem':
key = data['key']
if key in d.keys():
if d[key] <= data['val']:
del d[key]
sorted_keys.remove(key)
else:
d[key] -= data['val']
else:
pass # Deleting something not there yet
def method2(DATA):
d = SortedDict()
for data in DATA:
if data['mod'] == 'add':
key = data['key']
if key in d.keys():
d[key] += data['val']
else:
d[key] = data['val']
elif data['mod'] == 'rem':
key = data['key']
if key in d.keys():
if d[key] <= data['val']:
del d[key]
else:
d[key] -= data['val']
else:
pass # Deleting something not there yet
if __name__ == "__main__":
# METHOD 1
print("Method 1 Execution Time:")
print(timeit.timeit("test_timeit.method1(test_timeit.DATA)",
number=NUM_ITERATION_TEST,
setup="import test_timeit"))
# METHOD 2
print("Method 2 Execution Time:")
print(timeit.timeit("test_timeit.method2(test_timeit.DATA)",
number=NUM_ITERATION_TEST,
setup="import test_timeit"))
The results for the above is:
Method 1 Execution Time:
4.427699800000001
Method 2 Execution Time:
12.7445671
{5:1, 4:2}out of those 6 lines? Is the "key,mod,value" line part of the input, or is it meant to be commentary of some kind? I've read this 4 times now, and have no real idea what you're trying to do.k in thedict.keys(). That requires building a keys object every time, which is pure wasted time and space. Dok in thedictinstead, which directly calls the primitivethedict.__contains__(k)method.