I'm seeking advice about methods of implementing object persistence in Python. To be more precise, I wish to be able to link a Python object to a file in such a way that any Python process that opens a representation of that file shares the same information, any process can change its object and the changes will propagate to the other processes, and even if all processes "storing" the object are closed, the file will remain and can be re-opened by another process.
I found three main candidates for this in my distribution of Python - anydbm, pickle, and shelve (dbm appeared to be perfect, but it is Unix-only, and I am on Windows). However, they all have flaws:
- anydbm can only handle a dictionary of string values (I'm seeking to store a list of dictionaries, all of which have string keys and string values, though ideally I would seek a module with no type restrictions)
- shelve requires that a file be re-opened before changes propagate - for instance, if two processes A and B load the same file (containing a shelved empty list), and A adds an item to the list and calls sync(), B will still see the list as being empty until it reloads the file.
- pickle (the module I am currently using for my test implementation) has the same "reload requirement" as shelve, and also does not overwrite previous data - if process A dumps fifteen empty strings onto a file, and then the string 'hello', process B will have to load the file sixteen times in order to get the 'hello' string. I am currently dealing with this problem by preceding any write operation with repeated reads until end of file ("wiping the slate clean before writing on it"), and by making every read operation repeated until end of file, but I feel there must be a better way.
My ideal module would behave as follows (with "A>>>" representing code executed by process A, and "B>>>" code executed by process B):
A>>> import imaginary_perfect_module as mod
B>>> import imaginary_perfect_module as mod
A>>> d = mod.load('a_file')
B>>> d = mod.load('a_file')
A>>> d
{}
B>>> d
{}
A>>> d[1] = 'this string is one'
A>>> d['ones'] = 1 #anydbm would sulk here
A>>> d['ones'] = 11
A>>> d['a dict'] = {'this dictionary' : 'is arbitrary', 42 : 'the answer'}
B>>> d['ones'] #shelve would raise a KeyError here, unless A had called d.sync() and B had reloaded d
11 #pickle (with different syntax) would have returned 1 here, and then 11 on next call
(etc. for B)
I could achieve this behaviour by creating my own module that uses pickle, and editing the dump and load behaviour so that they use the repeated reads I mentioned above - but I find it hard to believe that this problem has never occurred to, and been fixed by, more talented programmers before. Moreover, these repeated reads seem inefficient to me (though I must admit that my knowledge of operation complexity is limited, and it's possible that these repeated reads are going on "behind the scenes" in otherwise apparently smoother modules like shelve). Therefore, I conclude that I must be missing some code module that would solve the problem for me. I'd be grateful if anyone could point me in the right direction, or give advice about implementation.
mongo-db. It's not as completely integrated into the language as your example above, but it will give you a much more robust and tolerant database than pickling to the filesystem and being smart about locks.