5

I'm currently working with some very large lists (50 to 100 million entries) of information where each item in the list is in the form of [float,(string_1,string_2)]

I'm adding elements to the list in an unsorted manner, and would eventually like to have a list that is sorted by the float value. For example I would have a list that looks like this:

[ [0.5,(A,B)], [-0.15,(B,C)], [0.3,(A,C)], [-0.8,(A,D)] ]

and then sort it to get

[ [0.5,(A,B)], [0.3,(A,C)], [-0.15,(B,C)], [-0.8,(A,D)] ]

Currently I'm using heapq to add items as I go along and then using sorted(heap) to ultimately give me the list I need. My question is: is there a better way to go about adding millions of items to a list and sorting them that won't crash my computer? Holding a list that long and then sorting it is causing some issues with my RAM.

6
  • With that many entries, you should really consider writing your data to a file as it is collected. Commented Jun 7, 2017 at 21:47
  • Why not using generators ? Instead of storing your data into a list then sorting it, create a generator then sort it once. Commented Jun 7, 2017 at 21:51
  • 1
    @ChihebNexus: A generator will still be converted to a list in memory to sort it. Commented Jun 8, 2017 at 0:52
  • @ShadowRanger True, but you'll escape storing the whole list in memory then sorting it again. Commented Jun 8, 2017 at 1:41
  • 3
    @ChihebNexus: You won't. If you just make the list up front (without pointless use of heapq), then call .sort() it's exactly the same. It saves if you're calling sorted(mygenerator) vs. sorted(mylist), but if you already have mylist, calling mylist.sort() avoids the memory overhead too. Commented Jun 8, 2017 at 13:56

1 Answer 1

7

sorted() creates an entirely distinct list, so doubles the RAM needed for the massive list. Use a list's .sort() method instead - that sorts the list in-place.

And unless there's something you haven't told us, leave heapq out of it entirely. Putting the entries in a heap serves no purpose I can think of. Just use a list's .append() method to add new entries, and apply .sort(reverse=True) to the list at end.

If you're still running out of RAM, then you simply can't solve this problem entirely in memory, and will need to craft an approach merging disk files.

LIVING WITH "TOO SMALL" RAM

In the worst case, even the list all by itself can't fit in available memory. You can still create the sorted sequence, but it requires writing sorted chunks to disk and merging them later. For the merging part, heapq is useful. Here's an example:

import pickle
import heapq

MAXPERFILE = 100  # the array will never get bigger than this

def getfname(i):
    return "pickled%d.dat" % i

filenum = 0
def dumptofile(a):  # dump the array to file, as pickled data
    global filenum
    fname = getfname(filenum)
    with open(fname, "wb") as f:
        pickle.dump(len(a), f)
        for x in a:
            pickle.dump(x, f)
    filenum += 1

# generate some random data
import random
a = []
for _ in range(1012):  # 10 "full" files with some leftovers
    a.append(random.random())
    if len(a) == MAXPERFILE:
        a.sort(reverse=True)
        dumptofile(a)
        del a[:]
if a:
    a.sort(reverse=True)
    dumptofile(a)

print("number of files written:", filenum)

# now merge the files together; first a function
# to generate the file contents, one at a time
def feedfile(i):
    fname = getfname(i)
    with open(fname, "rb") as f:
        count = pickle.load(f)
        for _ in range(count):
            yield pickle.load(f)

for x in heapq.merge(*(feedfile(i) for i in range(filenum)),
                     reverse=True):
    print(x)

Max memory use can be made smaller by decreasing MAXPERFILE, although performance will be better the larger MAXPERFILE. Indeed, if MAXPERFILE is small enough and the total amount of data is large enough, the merging code may die with an OS "too many open files" error.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.