1

I have an sqlite table with a few hundred million rows:

sqlite> create table t1(id INTEGER PRIMARY KEY,stuff TEXT );

I need to query this table by its integer primary key hundreds of millions of times. My code:

conn = sqlite3.connect('stuff.db')
with conn:
    cur = conn.cursor()
    for id in ids:
        try:
            cur.execute("select stuff from t1 where rowid=?",[id])
            stuff_tuple = cur.fetchone()
            #do something with the fetched row
        except:
            pass #for when id is not in t1's key set

Here, ids is a list that may have tens of thousands of elements. Forming t1 did not take very long (ie ~75K inserts per second). Querying t1 the way I've done it is unacceptably slow (ie ~1K queries in 10 seconds).

I am completely new to SQL. What am I doing wrong?

12
  • 1
    "I have an sqlite table with a few hundred million rows". Unless you absolutely need to stick to SQLite you should drop it and use real database. SQLite is not meant to handle such amount of data efficiently. Commented Oct 25, 2012 at 1:14
  • Interesting, any suggestions? I was originally just using a dict, but it turns out that I will have too much data to fit in RAM. I figured SQLite was the way to go. Commented Oct 25, 2012 at 1:16
  • I don't want to start the usual dispute, but any of MySQL, PostgreSQL, MSSQL, Oracle should do just fine. What's important they allow you to fine tune their performance characteristics and also split the load across multiple machines. Simply put you have enterprise-grade amount of data, so you should use enterprise-grade database engine. If you're on linux I'd recommend using PostgreSQL, I've used it for handling large datasets and it worked fine. There's also a good book about fine tuning it - amazon.com/PostgreSQL-High-Performance-Gregory-Smith/dp/… (NO affiliation) Commented Oct 25, 2012 at 1:20
  • 3
    If you were using a dict, then it seems that you don't need a relational database. Perhaps a simple key-value store will do? You may want to look into Redis or CouchDB. Commented Oct 25, 2012 at 1:21
  • 1
    Redis or Mongdb or any nosql databases are easy to setup and maintain. If you are using any of the relational databases, you will have to create a schema but you could write stored procedures and not write queries in python code. Commented Oct 25, 2012 at 1:24

2 Answers 2

1

Since you're retrieving values by their keys, it seems like a key/value store would be more appropriate in this case. Relational databases (Sqlite included) are definitely feature-rich, but you can't beat the performance of a simple key/value store.

There are several to choose from:

  • Redis: "advanced key-value store", very fast, optimized for in-memory operation
  • Cassandra: extremely high performance, scalable, used by multiple high-profile sites
  • MongoDB: feature-rich, tries to be "middle ground" between relational and NoSQL (and they've started offering free online classes)

And there's many, many more.

Sign up to request clarification or add additional context in comments.

Comments

-1

You should make one sql call instead, should be must faster

conn = sqlite3.connect('stuff.db')
with conn:
    cur = conn.cursor()

    for row in cur.execute("SELECT stuff FROM t1 WHERE rowid IN (%s)" % ','.join('?'*len(ids)), ids):
        #do something with the fetched row
        pass 

you do not need a try except since ids not in the db will not show up. If you want to know which ids are not in the results, you can do:

ids_res = set()
for row in c.execute(...):
    ids_res.add(row['id'])
ids_not_found = ids_res.symmetric_difference(ids)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.