2

I'm trying to create a more efficient way to sort lists and dictionaries in python and came across Efficient data structure keeping objects sorted on multiple keys. There the suggested solution was to use the heapq module.

However, in my tests a heap seems to be twice as slow as the native Python sorting algorithm. Below is the code I used to do a simple test. Results were for example:

Heap: 0.005993366241455078

Standard: 0.0020036697387695312

Is there a way to actually use a heap and increase the performance, as the post linked above claims? What would that code look like?

Here's the code to test it:

import  random
import time
from heapq import *

standardlist = []
heaplist = []
for i in range(10000):
    num = random.randint(0,10000)
    standardlist.append(num)
    heappush(heaplist, num)

# Standard sorting method:
start_time = time.time()
sorted_list = sorted(standardlist)
finish_time_1 = time.time() - start_time

# Heap sorting method:
start_time = time.time()
heap_sorted_list = [heappop(heaplist) for i in range(len(heaplist))]
finish_time_2 = time.time() - start_time

print("Standard Finish Time:", finish_time_1)
print("Heap Finish Time:", finish_time_2)
7
  • 4
    Please note that the question you linked is different from your case in that the OP is asking for sorting on multiple keys, and the task is referred to be add/delete heavy. None of these conditions seem to reflect your case. Find here a good discussion about heaps that might help you in understanding whether they are the right class for the case at your hand. Commented Feb 23, 2024 at 13:09
  • 5
    Python's TimSort algorithm is extremely efficient, you're wasting your time trying to beat it. Commented Feb 23, 2024 at 13:16
  • 2
    Heaps are not used for fast sorting, primarily they are used for fast access to max/min value (in O(1) time). Commented Feb 23, 2024 at 13:17
  • 1
    Inserting items in a heap is cheap (though not so much on binary heaps which are certainly used here). Deleting the min/max item from a heap is the expensive part (O(log n) time). If you delete all items, then it will take O(n log n) time. Python sort use a TimSort which also runs in O(n log n) but it is implemented in C, very optimized and TimSort is known to be usually faster than HeapSort. There is no chance you can be faster than sort with this code. In fact, with any general-purpose sort written in pure-Python. Commented Feb 23, 2024 at 13:22
  • 1
    Heap sort only worth it if you want to extract the k min/max items and k<<n. In fact, there are faster algorithm in this case too (partitioning can be done in O(n) and HeapSort takes O(n + k log n) not to mention the higher constant than optimized partitioning algorithms). Commented Feb 23, 2024 at 13:26

1 Answer 1

6

A heap data structure may be the right solution when you have a collection that changes over time (through insertions and deletions), and at each moment you want to have fast access to the minimum entry that is currently in the collection, and possibly extract it. There were such requirements in the Q&A you linked to.

If however the goal is just to sort a dataset once, then using a heap is not the most efficient.

Some remarks on your code:

  • The way you have populated the heap has O(𝑛log𝑛) time complexity. It is more efficient to first populate the list, and then call heapify on it: this has a time complexity of O(𝑛). Admittedly, this is not relevant for the timings you performed, but it also results in less code:

    standardlist = [random.randint(0,100000) for _ in range(1000000)]
    heaplist = standardlist[:]
    heapify(heaplist)
    
  • sorted is a native Python function, which can rely on compiled C code for the sort part. An explicit loop written in Python cannot beat that.

  • Although heapsort has an optimal time complexity, a well-implemented quicksort is usually 2–3 times faster than heapsort. See also Quicksort vs heapsort. Python actually uses a highly optimised sort algorithm.

  • Your code only performs one test, and it completes in a few milliseconds. This does not give very representative results. It would be better to repeat the test multiple times and get an average.

Sign up to request clarification or add additional context in comments.

7 Comments

nsmallest is implemented in pure Python, but the source talks about the complexity of the algorithm used. Loosely speaking, nsmallest can be faster because it doesn't necessarily need to maintain the heap property while extracting small values, unlike heappop.
@chepner, I mentioned "possibly through compiled C code", based on the info provided in Python heapq - Python and C implementation? Which one is used?
Ah, I missed that because it's waaaay down at the bottom of heapq.py. (I should have known better; I was aware that operator does the same thing.)
I did implement the changes you suggested: nsmallest(len(heaplist), heaplist) to sort the heap list or apply native sorted(heaplist) on the list on which I ran heapify before. Both are slower than native sorted (on multiple tests of course). I assume heap is not the right answer in my case to quickly and constantly update and sort lists.
This is expected as I explained in my answer. There are many situations where you constantly update a collection and still need the collection to remain sorted (maybe not completely?). Depending on the needs there are solutions. You may look into B-trees, AVL trees, red-black trees, skip lists, ...etc. You could benefit from Python Sorted Containers.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.