Optimizing a code to populate database using python (Django)

Question

I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.

This is the code with which I'm trying to populate the database:

import os

def populate():   
    with open("filename") as f:
        for line in f:
            col = line.strip().split("|")
            duns=col[1]
            name=col[8]
            job=col[12]        

            dun_add = add_c_duns(duns)   
            add_contact(c_duns = dun_add, fn=name, job=job)

def add_contact(c_duns, fn, job):
    c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
    return c

def add_c_duns(duns):
    cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
    return cd  

if __name__ == '__main__':
    print "Populating Contact db...."
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
    from web.models import Contact, Contact_DUNS
    populate()
    print "Done!!"

The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.

ORM's are rarely any good at bulk operations (in any language). They need to create objects in memory, track state, etc... They also tend to use connections inefficiently and not to optimise/batch multiple operations. To make things worse, you're asking the ORM to check if the object exists before picking an operation (INSERT/UPDATE) which is even more overhead per record. Depending on the database engine you're using, look at the appropriate bulk import tools or (at the very least) generating raw SQL statements. — Basic
– Basic, Commented Oct 3, 2015 at 22:04

gbs · Accepted Answer · 2015-10-04 09:40:27Z

I don't have enough reputation to comment, but here's a speculative answer.

Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.

So if you can live without the get part of get_or_create, then you can follow this strategy:

Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.

This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.

EDIT Here's the code:

import os

def populate_duns():
    # Will only work if there are no DUNS duplicates
    # (both in the DB and within the file)
    duns_instances = []   
    with open("filename") as f:
        for line in f:
            duns = line.strip().split("|")[1]        
            duns_instances.append(Contact_DUNS(duns=duns))

    # Run a single INSERT query for all DUNS instances
    # (actually it will be run in batches run but it's still quite fast)
    Contact_DUNS.objects.bulk_create(duns_instances)

def get_duns_dict():
    # This is basically a SELECT query for these two fields
    duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
    return dict(duns_id_pairs)

def populate_contacts():
    # Repeat the same process for Contacts
    contact_instances = []
    duns_dict = get_duns_dict()

    with open("filename") as f:
        for line in f:  
            col = line.strip().split("|")
            duns = col[1]
            name = col[8]
            job = col[12]

            ci = Contact(duns_id=duns_dict[duns],
                         fullName=name,
                         title=job)
            contact_instances.append(ci)

    # Again, run only a single INSERT query
    Contact.objects.bulk_create(contact_instances)

if __name__ == '__main__':
    print "Populating Contact db...."
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
    from web.models import Contact, Contact_DUNS
    populate_duns()
    populate_contacts()
    print "Done!!"

I'm experimenting with bulk_create() and raw() at the moment, still working on the code, if you could help me out with bulk_create I'd appreciate it!!!

e4c5 · Accepted Answer · 2015-10-04 14:22:58Z

CSV Import

First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.

There is no standard as to what a CSV file should look like, and the SQLite shell does not even attempt to handle all the intricacies of interpreting a CSV file. If you need to import a complex CSV file and the SQLite shell doesn't handle it, you may want to try a different front end, such as SQLite Database Browser.

On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.

Suitability of Sqlite.

You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.

SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.

Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.

The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)

So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.

Julian Go · Accepted Answer · 2015-10-03 23:54:14Z

0

6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.

I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.

answered Oct 3, 2015 at 23:54

Julian Go

4,4903 gold badges26 silver badges29 bronze badges

Collectives™ on Stack Overflow

Optimizing a code to populate database using python (Django)

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related