fast file loading on python

Question

I have two problems with loading data on python, both the scipts work properly but they need too much time to run and sometimes "Killed" is the result (with the first one).

I have a big zipped text file and I do something like this:

import gzip
import cPickle as pickle

f = gzip.open('filename.gz','r')
tab={}

for line in f:
        #fill tab

with open("data_dict.pkl","wb") as g:
        pickle.dump(tab,g)

f.close()

I have to do some operations on the dictionary I created in the previous script

import cPickle as pickle

with open("data_dict.pkl", "rb") as f:
        tab = pickle.load(f)
f.close()

#operations on tab (the dictionary)

Do you have other solutionsin mind? Maybe not the ones involving YAML or JSON...

Pickle is slow and can be pretty insecure. But you should at least add the hint to use the fastest pickle protocol (see docs): pickle.HIGHEST_PROTOCOL as a third parameter to your dump. Depending on what you really do, there are lots of other options to speed things up. (e.g. use an sqlite db for example). — schlenk
– schlenk, Commented Nov 20, 2013 at 22:09
Is the issue that you're loading everything into memory, rather than streaming? If so, you might want to check out streaming pickle (code.google.com/p/streaming-pickle). — user2141650
– user2141650, Commented Nov 20, 2013 at 22:11

1st1 · Accepted Answer · 2013-11-20 22:29:54Z

1

If the data you are pickling is primitive and simple, you can try marshal module: http://docs.python.org/3/library/marshal.html#module-marshal. That's what Python uses to serialize its bytecode, so it's pretty fast.

answered Nov 20, 2013 at 22:29

1st1

1,1018 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

qwwqwwq · Accepted Answer · 2013-11-20 22:30:07Z

First one comment, in:

with open("data_dict.pkl", "rb") as f:
        tab = pickle.load(f)
f.close()

f.close() is not necessary, the context manager (with syntax) does that automatically.

Now as for speed, I don't think you're going to get too much faster than cPickle for the purpose of reading in something from disk directly as a Python object. If this script needs to be run over and over I would try using memchached via pylibmc to keep the object stored persistently in memory so you can access it lightning fast:

import pylibmc

mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = range(10000)          ## some big object
mc["some_key"] = d        ## save in memory

Then after saving it once you can access and modify it, it stays in memory even after the previous program finishes its execution:

import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = mc["some_key"]        ## load from memory
d[0] = 'some other value' ## modify
mc["some_key"] = d        ## save to memory again

Collectives™ on Stack Overflow

fast file loading on python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related