Suppose I have a very large file and I want simply to divide into smaller chunks and process successively. However, in that implementation reading and writing those chunks is the bottleneck. Therefore I am looking for the best possible way. I am using cPickle presently, dumping and loading the chunks. Do you have any other alternative suggestion?
-
1Consider loading the file into a faster store (mmap as Ignacio suggested), or a high speed cache (like memcache or redis). That way the splitting and chunking part is sped up. You can't run away from IO if you will be writing the stuff to disk.Burhan Khalid– Burhan Khalid2013-11-06 08:15:52 +00:00Commented Nov 6, 2013 at 8:15
-
how large is your file (in Gb or Tb i assume?), and what file format?usethedeathstar– usethedeathstar2013-11-06 09:55:44 +00:00Commented Nov 6, 2013 at 9:55
-
@usethedeathstar file format is not a problem if one format is faster than the other, I can convert it. Problem is to find best read method with its convenient file format.erogol– erogol2013-11-06 12:09:13 +00:00Commented Nov 6, 2013 at 12:09
-
Can you convert the data to a simple C structure? You can use CFFI structure to dump memory to file or even use mmap. CFFI is many times faster on PyPy. Anyway hard drive are so much slow you even have time to compress/decompress the data (like LZO compression).Arpegius– Arpegius2013-11-06 15:04:41 +00:00Commented Nov 6, 2013 at 15:04
-
@Erogol yeah which is why it is interesting to know what file format you use right now, and how big the file is in your current file format.usethedeathstar– usethedeathstar2013-11-06 15:43:50 +00:00Commented Nov 6, 2013 at 15:43
2 Answers
mmap will map part of the file cache into process memory, allowing pointer-based (or in Python's case, index-/slice-based) access to the bytes in the file. From there you can slice the mmap object to get strings, and pass them to cPickle.loads() in order to restore the original objects.
Comments
You probably won't get any faster than file.read(chunksize) to read chunksize bytes from a file. You can just do that until you read less than chunksize bytes (then you know you've hit the end). e.g.:
with open('datafile') as fin:
data = fin.read(chunksize)
process(data)
while len(data) == chunksize
data = fin.read(chunksize)
process(data)
However, since you say that you're using cPickle -- I'm not really sure what the data looks like, or if you're looking for something more sophisticated...
And a word of warning -- Generally speaking, fileIO is one of the slowest things you can do with your computer. If you're doing a lot of it, you can expect it to be a bottleneck no matter what you do (unless you have a really fancy file system -- Then you might be able to do something about it).