Python SQLite cache in memory

Question

I am using Python 2.7 and SQLite. I am building a database with millions of rows. I would like to only write out to disk occasionally, with the idea this will improve performance. My thought was to only call commit() from time to time. I have tried that with the code below. The selects in the middle show that we get consistent reads. But, when I look on disc, I see a file example.db-journal. This must be where the data is being cached. In which case this would gain me nothing in terms of performance. Is there a way to have the inserts collect in memory, and then flush them to disc? Is there a better way to do this?

Here is my sample code:

import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('CREATE TABLE if not exists stocks (date text, trans text, symbol text, qty real, price real)')

c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get 2 rows as expected.
print c.fetchall()
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")

conn.commit()

t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get all the rows as expected.
print c.fetchall()

conn.close()

Update:

Figured I would give an update with some code in case anyone runs across this problem. I am processing 5+ million lines from a text file and needed a place to store the data for more processing. I originally had all the data in memory, but, was running out of memory. So, I switched to SQLite for a disc cache. My original in memory version of the processing took ~36 secs per 50,000 rows from the original text file.

After measuring, my first cut on SQLite version of the batch processing took ~660 seconds for 50,000 lines. Based on the comments (thanks to the posters), I came up with the following code:

self.conn = sqlite3.connect('myDB.db', isolation_level='Exclusive')
self.cursor.execute('PRAGMA synchronous = 0')
self.cursor.execute('PRAGMA journal_mode = OFF')

In addition, I commit after processing 1000 lines from my text file.

if lineNum % 1000 == 0:
    self.conn.commit()

With that, 50,000 lines from the text file now takes ~40 seconds. So, I added 11% to the overall time, but, memory is constant, which is more important.

If you really need this (measure!), you can try using an in-memory sqlite DB (":memory:"), and dump its contents periodically to an on-disk DB. — ev-br
– ev-br, Commented May 15, 2013 at 18:36

Thomas Fenzl · Accepted Answer · 2013-05-15 18:46:49Z

4

Firstly, are you sure you need this? For reading, the OS should cache the file anyway, and if you write a lot, not syncing to disc means you can lose data easily.

If you measure and identify this as a bottleneck, you can use an in-memory database using connect(':memory:') and get an iterator returning an sql dump on demand: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.iterdump

import sqlite3, os

in_memory = sqlite3.connect(':memory:')
# do stuff

con = sqlite3.connect('existing_db.db')
con.execute('drop table stocks')
for line in in_memory.iterdump():
    con.execute(line)

Again, measure if you need this. If you have enough data that it matters, think hard about using a different data store, for example a full blown DBMS like postgres.

edited May 15, 2013 at 18:46

answered May 15, 2013 at 18:40

Thomas Fenzl

4,4221 gold badge19 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Doo Dah Over a year ago

Ok, fair enough. I have not measured. Seems obvious that caching would be the right thing to do until you pointed out otherwise. The code you write above seems like it would do what I ask, but, could potentially be slower (i.e. iterating over the in-memory data set? Either way, based on other comments looks right to me.

Thomas Fenzl Over a year ago

It was more of an interesting thought experiment than something I'd do if I value the data...

MrGumble · Accepted Answer · 2013-05-15 19:00:09Z

1

In your case, you are creating a db-connection in autocommit mode, which means that every time you execute an INSERT statement, the database starts a transaction, executes the statement and commits. So your commit is - in this case - meaningless. See sqlite3 on python docs.

But you are correct that inserting a large quantity of rows should ideally be done within a transaction. This signals the connection, that it should record all the incoming INSERT statements in the journal-file, but delaying writing to the database file until the transaction is committed. Even though your execution is limited by the I/O operations, writing to the journal file is no serious performance penalty.

answered May 15, 2013 at 19:00

MrGumble

5,7661 gold badge21 silver badges35 bronze badges

Collectives™ on Stack Overflow

Python SQLite cache in memory

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related