Efficient data structure keeping objects sorted on multiple keys

Question

I have a python program where I am using a priority queue to keep track of objects to be processed. At the moment the queue is implemented using a SortedList, which is working very well.

However, I need to extend this code, so that the list is kept sorted on multiple keys. Somewhat like an SQL database with indexes on multiple columns, so that I can efficiently access, add and delete objects on all keys. My workload is add/delete heavy. To give an idea of what I want to do, here is some pseudo code:

ml = MultiSortedList()
ml.append((1, "z", 1.5), a)
ml.append((2, "a", 40.0), b)
ml.append((3, "f", 0.5), c)

print(ml.sorted(0))
   [((1, "z", 1.5), a),
   ((2, "a", 40.0), b),
   ((3, "f", 0.5), c),]

print(ml.sorted(2))
   [((3, "f", 0.5), c),
   ((1, "z", 1.5), a),
   ((2, "a", 40.0), b)]

print(ml.sorted(2).pop(1)
   (1, "z", 1.5), a)

print(ml.sorted(0))
   [((2, "a", 40.0), b),
   ((3, "f", 0.5), c)]

I can't quite figure out how to do this efficiently. Of course I could sort the list again for each access to a different column, but that is simply too expensive. Also, the O(n) delete operations on python lists become painful, as the list may contain several thousands of objects.

Is there an existing data structure (preferably already implemented in python) which solves this problem? If not, can you help me with an outline for how to implement this efficiently?

How many columns/ different fields are there ? Is it a fixed number ? — advocateofnone
– advocateofnone, Commented Mar 18, 2015 at 9:55
if you're looking for SQL-style functionality within python, have you looked at the pandas package - pandas.pydata.org ? — John Greenall
– John Greenall, Commented Mar 18, 2015 at 11:03
@sasha: About 4-5 columns. I can live with rewriting the application if this changes. Problem with doing it in SQL is simply overhead. This is a datastructure that is going to be accessed a lot, so I think implementing it as adding/dropping rows in a database (and serializing objects) will have too high overhead. — pehrs
– pehrs, Commented Mar 18, 2015 at 11:54

loopbackbee · Accepted Answer · 2015-03-18 11:33:15Z

6

Instead of using a sorted list as the implementation of your priority queue, you should be using a heap. A binary heap, for example, has log(n) insertion and deletion. A Fibonacci heap will have O(1) insertion.

You have a implementation inside the standard library: the heapq module.

For your use case, you'll need to keep multiple heaps: one for each sorting order. For clarity of implementation, you may want to hold your original data on a dictionary (possibly using random or incrementing keys) and only keep its keys on the heaps.

Using this technique, you can easily get O(1) insertion and O(log(n)) deletion. You don't really specify what kind of accesses you'll be having, but if you need random access, you can use binary search on the appropriate heap to get O(log(n))

edited Mar 18, 2015 at 11:33

answered Mar 18, 2015 at 11:10

loopbackbee

23.6k11 gold badges69 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pehrs Over a year ago

I ended up using a solution similar to this with a wrapper which inserts and deletes objects from all the heaps as needed. Thanks for the help!

Arthur Vaïsse · Accepted Answer · 2015-03-18 13:41:25Z

One approach is to maintain n list internally, one for each sorting order, and to add/remove item for each of them. This way, you "only" multiply the manipulation time of your data structure by a constant value (i.e. if using 3 keys instead of one, then got 3 log(n) instead of log(n) to delete/insert an element).

The concept I imagine behind such implementation is the java's Comparator one.

Create for each key a comparator method to use to sort them, and then use it at insert and deleting time.

It would work as follow :

class SortedList(list):
    def __init__(self, comparator = None):
        list.__init__(self)

        self.comparator = comparator

    def add(self, element):
        """ Adds the element into the sorted list.
        If no comparator where provided at initialisation, then try to compare
        the element to item already stored in the list using standard __gt__ 
        and __eq__.

        Argument :
            element : the element to insert in the list

        /!\ NB : A more inteligent implementation should be used, such as binary search... /!\
        """
        index = 0
        while (index < len(self)):
            if self.isGreatterThan(element, self[index]):
                index += 1
            else:
                break

        self.insert(index, element)

    def remove(self, element):
        """ Same as previous, a better approach is possible that use binary search """
        list.remove(self, element)


    def isGreatterThan(self, element, otherElement):
        """ Compare if element is greater than other element. """
        if self.comparator != None:
            return self.comparator(element, otherElement)
        else:
            return element.__gt__(otherElement)


class MultipleKeysObjectContainer():
    def __init__(self, comparators):
        #register the comparators
        self.comparators = comparators
        #create a List for each comparator
        self.data = {}
        for key in self.comparators.keys():
            self.data[key] = SortedList(comparator = self.comparators[key])

    def add(self, element):
        for key in self.data.keys():
            self.data[key].add(element)

    def __repr__(self):
        return "<MultipleKeysObjectContainer :"+self.data.__repr__()+">"

    def __str__(self):
        result = "MultipleKeysObjectContainer{\n"
        for key in self.data.keys():
            result += "\tOrder by : "+key+"{\n"
            for element in self.data[key]:
                result += "\t\t" + str(element) + "\n"
            result += "\t}\n"
        result += "}"
        return result

    def popOrderedBy(self, key, position):
        """ pop the item a the position in the list of item ordered by the key.
        Remove also from other data containers. """
        item = self.data[key].pop(position)
        for dataKey in self.data.keys():
            if dataKey != key:
                self.data[dataKey].remove(item)

        return item



if __name__ == "__main__":
    a = SortedList(lambda x,y : x[0][0] > y[0][0])

    item1 = ((1, "z", 1.5),"foo")
    item2 = ((2, "a", 40.0), "bar")
    item3 = ((3, "f", 0.5), "barfoo")

    a.add(item1)
    a.add(item3)
    a.add(item2)
    print("Example of sorted list")
    print(a)
    a.remove(item3)
    print("The same list without the barfoo")
    print(a)


    b = MultipleKeysObjectContainer({"id": (lambda x,y : x[0][0] > y[0][0]), "letter": (lambda x,y : x[0][1] > y[0][1] ), "value":(lambda x,y : x[0][2] > y[0][2])})

    b.add(item1)
    b.add(item3)
    b.add(item2)
    print("A multiple key container, object are ordered according three criterion.")
    print(b)
    print("Remove the first item if items are ordered by letter", b.popOrderedBy("letter", 0))
    print("After this removing the container contains :")
    print(b)

This results in :

Example of sorted list
[((1, 'z', 1.5), 'foo'), ((2, 'a', 40.0), 'bar'), ((3, 'f', 0.5), 'barfoo')]
The same list without the barfoo
[((1, 'z', 1.5), 'foo'), ((2, 'a', 40.0), 'bar')]
A multiple key container, object are ordered according three criterion.
MultipleKeysObjectContainer{
    Order by : id{
        ((1, 'z', 1.5), 'foo')
        ((2, 'a', 40.0), 'bar')
        ((3, 'f', 0.5), 'barfoo')
    }
    Order by : value{
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
        ((2, 'a', 40.0), 'bar')
    }
    Order by : letter{
        ((2, 'a', 40.0), 'bar')
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
    }
}
Remove the first item if items are ordered by letter ((2, 'a', 40.0), 'bar')
After this removing the container contains :
MultipleKeysObjectContainer{
    Order by : id{
        ((1, 'z', 1.5), 'foo')
        ((3, 'f', 0.5), 'barfoo')
    }
    Order by : value{
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
    }
    Order by : letter{
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
    }
}

That look like you are looking for ( almost, just have to add binary search :p ) Good luck!

Collectives™ on Stack Overflow

Efficient data structure keeping objects sorted on multiple keys

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related