Fastest way to read and write data in Python?

Question

Suppose I have a very large file and I want simply to divide into smaller chunks and process successively. However, in that implementation reading and writing those chunks is the bottleneck. Therefore I am looking for the best possible way. I am using cPickle presently, dumping and loading the chunks. Do you have any other alternative suggestion?

Consider loading the file into a faster store (mmap as Ignacio suggested), or a high speed cache (like memcache or redis). That way the splitting and chunking part is sped up. You can't run away from IO if you will be writing the stuff to disk. — Burhan Khalid
– Burhan Khalid, Commented Nov 6, 2013 at 8:15
how large is your file (in Gb or Tb i assume?), and what file format? — usethedeathstar
– usethedeathstar, Commented Nov 6, 2013 at 9:55
@usethedeathstar file format is not a problem if one format is faster than the other, I can convert it. Problem is to find best read method with its convenient file format. — erogol
– erogol, Commented Nov 6, 2013 at 12:09
Can you convert the data to a simple C structure? You can use CFFI structure to dump memory to file or even use mmap. CFFI is many times faster on PyPy. Anyway hard drive are so much slow you even have time to compress/decompress the data (like LZO compression). — Arpegius
– Arpegius, Commented Nov 6, 2013 at 15:04
@Erogol yeah which is why it is interesting to know what file format you use right now, and how big the file is in your current file format. — usethedeathstar
– usethedeathstar, Commented Nov 6, 2013 at 15:43

Ignacio Vazquez-Abrams · Accepted Answer · 2013-11-06 08:06:40Z

2

mmap will map part of the file cache into process memory, allowing pointer-based (or in Python's case, index-/slice-based) access to the bytes in the file. From there you can slice the mmap object to get strings, and pass them to cPickle.loads() in order to restore the original objects.

answered Nov 6, 2013 at 8:06

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mgilson · Accepted Answer · 2013-11-06 07:58:14Z

You probably won't get any faster than file.read(chunksize) to read chunksize bytes from a file. You can just do that until you read less than chunksize bytes (then you know you've hit the end). e.g.:

with open('datafile') as fin:
    data = fin.read(chunksize)
    process(data)
    while len(data) == chunksize
        data = fin.read(chunksize)
        process(data)

However, since you say that you're using cPickle -- I'm not really sure what the data looks like, or if you're looking for something more sophisticated...

And a word of warning -- Generally speaking, fileIO is one of the slowest things you can do with your computer. If you're doing a lot of it, you can expect it to be a bottleneck no matter what you do (unless you have a really fancy file system -- Then you might be able to do something about it).

Collectives™ on Stack Overflow

Fastest way to read and write data in Python?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related