0

So, the thing is I am having a moderately large list of emails ~ 250,000 entries.

I have another table containing list of invalid emails ~ 50,000 which i need to remove (mark inactive) from 1st table. For that I have ran a simple django function which is taking 3-4 seconds in each loop. The code is:

def clean_list():
    id = 9
    while id<40000:
        i = Invalid.objects.get(id=id)
        y = i.email.strip()
        f = IndiList.objects.get(email__contains=y)
        f.active = False
        f.save()
        id +=1

What would be a better way to do it? Either a SQL query or a better piece of django code or some other way.

Help!

3 Answers 3

1

Untested:

IndiList.objects.filter(email__in=Invalid.objects.only('email').all()).update(active=False)

I am not sure if Django is smart enough to build a subquery from that, if not, then this should do:

IndiList.objects.filter(email__in=Invalid.objects.all().values_list('email', flat=True)).update(active=False)

The problem with the second approach is that it will generate 2 queries instead of one, and inject 50,000 ids into the second sql query string, so I would much rather just use raw sql at this point:

from django.db import connection

cursor = connection.cursor()
cursor.execute = 'UPDATE indilist SET active=false WHERE email IN (SELECT email FROM invalid)'
Sign up to request clarification or add additional context in comments.

1 Comment

Exactly, Raw SQL is way faster.
1

There are a couple optimisations you might want to take a look at. Instead of looping over a get for each object try getting a values list:

queryset = Invalid.objects.filter(id__range=(9,40000)) queryset_list = queryset.values_list('email' flat=True)

https://docs.djangoproject.com/en/1.10/ref/models/querysets/#values-list

then looping over the values list and doing a .get() on the email. At the end you can also do:

f.active = False f.save(update_fields=['active'])

Which will only update the boolean field. https://docs.djangoproject.com/en/1.10/ref/models/instances/#updating-attributes-based-on-existing-fields

Also try to find a way to .get() the object via id or some other field than string if possible.

Comments

0

After a few iterations, i used this function which was more than 1000 times faster.

def clean_list3():
    pp = Invalid.objects.filter(id__gte=9)
    listd = [oo.email.strip() for oo in pp]
    for e in IndiList.objects.all():
        if e.email.strip() in listd:
            e.active=False
            e.save()
            print(e.id)

The trick is simple, instead of hitting database every time, I saved the 250,000 objects in a queryset in memory. and also the list of emails from invalid list in memory.

And then i had to hit database only when we found matching emails so as to save it as inactive.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.