How to implement database-style table in Python

Question

I am implementing a class that resembles a typical database table:

has named columns and unnamed rows
has a primary key by which I can refer to the rows
supports retrieval and assignment by primary key and column title
can be asked to add unique or non-unique index for any of the columns, allowing fast retrieval of a row (or set of rows) which have a given value in that column
removal of a row is fast and is implemented as "soft-delete": the row is kept physically, but is marked for deletion and won't show up in any subsequent retrieval operations
addition of a column is fast
rows are rarely added
columns are rarely deleted

I decided to implement the class directly rather than use a wrapper around sqlite.

What would be a good data structure to use?

Just as an example, one approach I was thinking about is a dictionary. Its keys are the values in the primary key column of the table; its values are the rows implemented in one of these ways:

As lists. Column numbers are mapped into column titles (using a list for one direction and a map for the other). Here, a retrieval operation would first convert column title into column number, and then find the corresponding element in the list.
As dictionaries. Column titles are the keys of this dictionary.

Not sure about the pros/cons of the two.

The reasons I want to write my own code are:

I need to track row deletions. That is, at any time I want to be able to report which rows where deleted and for what "reason" (the "reason" is passed to my delete method).
I need some reporting during indexing (e.g., while an non-unique index is being built, I want to check certain conditions and report if they are violated)

@delnan @katrielalex: just edited my question to give a couple of reasons. Maybe there's a way to do that with sqlite wrapper? — max
– max, Commented Nov 15, 2010 at 20:05

unutbu · Accepted Answer · 2010-11-15 21:15:16Z

2

You might want to consider creating a class which uses an in-memory sqlite table under the hood:

import sqlite3

class MyTable(object):
    def __init__(self):
        self.conn=sqlite3.connect(':memory:')
        self.cursor=self.conn.cursor()
        sql='''\
            CREATE TABLE foo ...
        '''
        self.execute(sql)
    def execute(self,sql,args):
        self.cursor.execute(sql,args)
    def delete(self,id,reason):
        sql='UPDATE table SET softdelete = 1, reason = %s where tableid = %s'
        self.cursor.execute(sql,(reason,id,))
    def verify(self):
        # Check that certain conditions are true
        # Report (or raise exception?) if violated
    def build_index(self):
        self.verify()
        ...

Soft-delete can be implemented by having a softdelete column (of bool type). Similarly, you can have a column to store reason for deletion. Undeleting would simply involve updating the row and changing the softdelete value. Selecting rows that have not been deleted could be achieved with the SQL condition WHERE softdelete != 1.

You could write a verify method to verify conditions on your data are satisfied. And you could call that method from within your build_index method.

Another alternative is to use a numpy structured masked array.

It's hard to say what would be fastest. Perhaps the only sure way to tell would be to write code for each and benchmark on real-world data with timeit.

answered Nov 15, 2010 at 21:15

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

max Over a year ago

I like the softdelete column idea. I don't think I can do the verify method since my conditions were being checked while the index is being built (row by row); but it may be a small price to pay if I can rely on sqlite instead of a custom class. And I was interested to learn about MaskedArray too, never heard of it before.

Morlock · Accepted Answer · 2010-11-15 19:56:59Z

2

I would consider building a dictionary with keys that are tuples or lists. Eg: my_dict(("col_2", "row_24")) would get you this element. Starting from there, it would be pretty easy (if not extremely fast for very large databases) to write 'get_col' and 'get_row' methods, as well as 'get_row_slice' and 'get_col_slice' from the 2 preceding ones to gain access to your methods.

Using a whole dictionary like that will have 2 advantages. 1) Getting a single element will be faster than your 2 proposed methods; 2) If you want to have different number of elements (or missing elements) in your columns, this will make it extremely easy and memory efficient.

Just a thought :) I'll be curious to see what packages people will suggest!

Cheers

answered Nov 15, 2010 at 19:56

Morlock

7,16116 gold badges45 silver badges50 bronze badges

3 Comments

max Over a year ago

Interesting. It seems faster to me too, but I'm not sure: conceivably getting a list element might be as fast as the extra work needed to calculate hash for a tuple rather than a single value. Btw, I am only concerned with time, not memory, efficiency.

Morlock Over a year ago

Then a dictionary will DEFINITELY be faster when the table get bigger. In a list of one million elements, you need to search on average half a million elements before you get to the right one. With a hash table, hence a dictionary, you need to search a maximum of 20 elements to find the right one. Cheers!

max Over a year ago

I agree, but when I said "list" I meant to use a dictionary to convert column name to list index for O(1) access. It's easier to just use dictionary to start with.

Roger Binns · Accepted Answer · 2010-11-20 05:44:09Z

0

You really should use SQLite.

For your first reason (tracking deletion reasons) you can easily implement this by having a second table that you "move" rows to on deletion. The reason can be tracked in additional column in that table or another table you can join. If a deletion reason isn't always required then you can even use triggers on your source table to copy rows about to be deleted, and/or have a user defined function that can get the reason.

The indexing reason is somewhat covered by constraints etc but I can't directly address it without more details.

answered Nov 20, 2010 at 5:44

Roger Binns

3,3582 gold badges30 silver badges33 bronze badges

Collectives™ on Stack Overflow

How to implement database-style table in Python

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related