2

I have two problems with loading data on python, both the scipts work properly but they need too much time to run and sometimes "Killed" is the result (with the first one).

  1. I have a big zipped text file and I do something like this:

    import gzip
    import cPickle as pickle
    
    f = gzip.open('filename.gz','r')
    tab={}
    
    for line in f:
            #fill tab
    
    with open("data_dict.pkl","wb") as g:
            pickle.dump(tab,g)
    
    f.close()
    
  2. I have to do some operations on the dictionary I created in the previous script

    import cPickle as pickle
    
    with open("data_dict.pkl", "rb") as f:
            tab = pickle.load(f)
    f.close()
    
    #operations on tab (the dictionary)
    

Do you have other solutionsin mind? Maybe not the ones involving YAML or JSON...

2
  • Pickle is slow and can be pretty insecure. But you should at least add the hint to use the fastest pickle protocol (see docs): pickle.HIGHEST_PROTOCOL as a third parameter to your dump. Depending on what you really do, there are lots of other options to speed things up. (e.g. use an sqlite db for example). Commented Nov 20, 2013 at 22:09
  • 1
    Is the issue that you're loading everything into memory, rather than streaming? If so, you might want to check out streaming pickle (code.google.com/p/streaming-pickle). Commented Nov 20, 2013 at 22:11

2 Answers 2

1

If the data you are pickling is primitive and simple, you can try marshal module: http://docs.python.org/3/library/marshal.html#module-marshal. That's what Python uses to serialize its bytecode, so it's pretty fast.

Sign up to request clarification or add additional context in comments.

Comments

0

First one comment, in:

with open("data_dict.pkl", "rb") as f:
        tab = pickle.load(f)
f.close()

f.close() is not necessary, the context manager (with syntax) does that automatically.

Now as for speed, I don't think you're going to get too much faster than cPickle for the purpose of reading in something from disk directly as a Python object. If this script needs to be run over and over I would try using memchached via pylibmc to keep the object stored persistently in memory so you can access it lightning fast:

import pylibmc

mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = range(10000)          ## some big object
mc["some_key"] = d        ## save in memory

Then after saving it once you can access and modify it, it stays in memory even after the previous program finishes its execution:

import pylibmc
mc = pylibmc.Client(["127.0.0.1"], binary=True,behaviors={"tcp_nodelay": True,"ketama": True})
d = mc["some_key"]        ## load from memory
d[0] = 'some other value' ## modify
mc["some_key"] = d        ## save to memory again

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.