5

Recently I came across the following issue: How can you iterate over a really big data query in order to do actions (say for every object create two different objects). In case you handle a small queryset this is simple:

for obj in Mymodel.objects.all():
    create_corresponding_entries(obj)

Now try doing this in a queryset with 900k objects. Probably your pc will freeze up because it will eat up all memory. So how can I achieve this lazily? The same question happens whether you use Django ORM or SQLAlchemy

4 Answers 4

10

I do not know if I misunderstood your question or if the answers are before current versions of Django, but for Django see: https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

for i in Mymodel.objects.iterator(chunk_size=2000):
    print(i)

As in the docs for some databases it is implemented with cursors on the RDBMS on some others with some tricks.

Sign up to request clarification or add additional context in comments.

1 Comment

The accepted answer by @JohnParaskevopoulos is appropriate if you want to get a queryset of the model instances and fetch them in such a size from the database in each iteration of the for loop. Your answer is appropriate if you are trying to process the model instances one by one (getting one instance in each iteration of the for loop), but want to fetch them from the database in a larger batch. There are similarities and with small additional code can achieve essentially the same behavior.
9

Although Django ORM gives a "lazy" Queryset, what I was looking for was a generator that would provide me a way to lazily get my objects. Querysets in django are not really lazy, they are lazy until you try to access them, where the database will hit and fetch you 1M entries. SQLAlchemy does the same. In case you have oracle or postgre database you're lucky and you can use the supported server side cursors. SQLAlchemy also supports for these plus mysql in case you use mysqldb or pymysql dialects. I'm not sure how server side cursors work behind the scenes.

More info for

So in case you don't fit in any of the above cases, you have to figure a way to lazily fetch these objects. Because both Django ORM and SQLAlchemy support slicing by translating this to pure SQL queries, I figured I could use a custom generator to slice me batches of the queries i needed.

Disclaimer: The solution is trying to solve issues in dumping a lot of data locally, it doesn't try to maximize performance in queries or anything performance related to the database.

Warning: This will result in more queries to the databases than the simple Mymodel.objects.all() but will challenge your RAM less.

def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
    counter = start
    while counter < max_count:
        yield fetch_func()[counter:counter + max_obj]
        counter += max_obj

and then to use it for example:

fetcher = lazy_bulk_fetch(50, Mymodel.objects.count(), lambda: Mymodel.objects.order_by('id'))
for batch in fetcher:
    make_actions(batch)

this will fetch me for each iteration a list of 50 objects until I reach the maximum count I want. If you change make_actions(batch) with print(batch.query) in django you'll see something like the following:

SELECT "services_service"."id" FROM "services_service" LIMIT 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 50
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 100
SELECT "services_service"."id" FROM "services_service" LIMIT 50 OFFSET 150

The same concept can be used with slice that SQLAlchemy supports. The solution in this case would be the same but instead of python slicing you would use the slice function of SQLAlchemy Query object

EDIT: From what I saw SQLAlchemy Query class implements the __getitem__ function. So for SQLAlchemy you could use the exact same function I suggested for Django. If you want to explicitly use the slice function you would end up in something like the following:

def lazy_bulk_fetch(max_obj, max_count, fetch_func, start=0):
    counter = start
    while counter < max_count:
        yield fetch_func().slice(counter, counter + max_obj)
        counter += max_obj

in any case you would call it like this:

from sqlalchemy import func
fetcher = lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)), 
                          lambda: session.query(Mymodel).order_by(Mymodel.id))

Two notes here:

  1. You want to use func.count in order for this to be translated to a COUNT SQL statement in the server. If you use len(session.query(Mymodel)) you'll dump everything locally, find it's length and then throw it away
  2. I use the lambda so that the implementation is like the django one. I could also have

    lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)), 
                    session.query(Mymodel).order_by(Mymodel.id)) 
    

    but then I would have to have in my function

    yield fetch_func.slice(counter, counter + max_obj)
    

EDIT #2: I added ordering since otherwise you cannot be sure that you wont get the same results in the Nth run. Ordering guarantees that you will get unique results. It's better to have the id as the ordering key, otherwise you cannot be sure that you miss a result (because during the Nth hit, a new entry might have been added and ordering without the id could result in you missing it or getting double entries)

4 Comments

Note that this method does not guarantee that every object in the collection will be fetched exactly once if your max_obj is smaller than the total number of entries. This is because this slicing method does not guarantee the results to be ordered. What this means is that if your batch size is a lot smaller than the total object count, there is a big chance that you will retrieve the same object multiple times and some objects will never be retrieved. This is probably not the behavior that one wants!
you are right. you need to wrap an order_by to make sure you get unique results (since then the offset will work based on the ordering). For example in Django: instead of lazy_bulk_fetch(50, Mymodel.objects.count(), Mymodel.objects.all) you should use lazy_bulk_fetch(50, Mymodel.objects.count(), lambda: Mymodel.objects.order_by('id')) and in sqlalchemy lazy_bulk_fetch(50, session.query(func.count(Mymodel.id)), lambda: session.query(Mymodel).order_by(Mymodel.id) am I missing something or should I update the answer?
that would indeed fix the problem, although I am not sure if this is the most efficient solution. For sure one should probably make sure that for big databases there is an index on the sorting column. But at least this solution should now be correct.
as I said in the beginning of the answer, this solution is certainly not efficient db-wise but I found it to be quite efficient (until a more efficient solution is found) backend-wise. even though it's an edge case (trying to fetch too many entries is not something one usually does; they use filters and fetch less) I found this to be the only way so that the memory of the backend is not so hard-challenged. Thanks for remark
0

If you offload the processing to the database (via Django ORM), the entire operation could be done in 3 database calls:

  1. Call values_list to obtain a list of all primary keys. With 900K keys of 64 bytes each, it should still take only around 56 MB of memory, which should not put your system under undue stress.
model_ids = MyModel.objects.values_list('id', flat=True)
  1. Now, decide how many entries you wish to load at a time. If you call in_bulk with a subset of values_list, you can handle this in chunks your system is comfortable with. For all entries, set CHUNK_SIZE to len(model_ids). ("3 database calls" comment holds only if you call in_bulk with CHUNK_SIZE > len(model_ids). Memory load would depend on how big MyModel is, and CPU load should be minimal.)
for counter in range(0, len(model_ids), CHUNK_SIZE):
     chunk = MyModel.objects.in_bulk(model_ids[counter:counter+CHUNK_SIZE])
# Do whatever you wish with this chunk, like create the objects but in place.
  1. The last part is where you are creating other objects. This is an ideal place to use bulk_create, which will make the entire process so much more efficient. Even if you do not use in_bulk and values_list, bulk_create will give you a significant advantage if you are creating anything over 2-3 objects. Combining with the code from step 2, you could do something like this:
objs_to_create = []
for counter in range(0, len(model_ids), CHUNK_SIZE):
     chunk = MyModel.objects.in_bulk(model_ids[counter:counter+CHUNK_SIZE])
     # Populate the object(s), either directly or in loop, but using MyModel 
     # constructor, not ORM query. That is, use 
     # m = MyModel(..)
     # instead of 
     # m = MyModel.objects.create(..)
     # Append each of the created MyModel python objects to objs_to_create. Note 
     # that we have not created these objects in the database yet.
     # ...
     # Now create these objects in database using a single call
     MyModel.objects.create_bulk(objs_to_create)
     # Rinse and repeat
     objs_to_create = []

No more CPU hangs, and you can fine-tune memory usage to your heart's content.

Comments

0

Based on @John Paraskevopoulos's answer for Django ORM, I adapted it a bit to make it perhaps a little more general:

def bulkFetch(Cls,  batchSize: int = 100, start: int = 0, end: int = None, fetchFunc: Callable = None):
    '''
    Query Django model instances and retrieve the instances lazily in batches.
    Params:
    - Cls: the Django model class
    - batchSize: number of instances to yield each iteration
    - start: start number to yield from the queryset
    - end: end order number to yield from the queryset
    - fetchFunc: a function to retrieve instances. By default set to None: all model instances of the given class will be retrieved.
    '''
    counter = start
    maxCount = Cls.objects.count()

    if end is not None and end < maxCount:
        maxCount = end

    def defaultFetchFunc():
        qs = Cls.objects.order_by('pk')
        if end is None:
            return qs
        else:
            return qs[:end]

    if fetchFunc is None:
        fetchFunc = defaultFetchFunc

    while counter < maxCount:
        yield fetchFunc()[counter:counter+batchSize]
        counter += batchSize

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.