3

How can one convert a huge text file (>16G) containing binary-valued characters (0 and 1) to a numpy array file without blowing up the memory in python? Assuming we have enough storage on the machine but not enough RAM for the conversion.

Sample data:

0,0,0,0,0,1,0,0,0 
1,0,0,1,0,0,0,0,0
...

Sample code:

converted_data = [ map(int,line.split(',')) for line in f ]
8
  • 1
    What do you mean by "binary-valued strings"? What's the exact format of the existing file? You might be able to do something with a memory-mapped array, depending on the existing format. Commented Jun 17, 2015 at 22:42
  • 5
    This is a good topic but there's not enough information given to provide a useful answer. Can you post the code that would work to read a smaller file? If you can't do that, please provide more information about the format of the file, etc. Commented Jun 17, 2015 at 23:24
  • The test file contains "0" and "1" characters separated by comma. Commented Jun 18, 2015 at 15:45
  • Here is the sample code I used to do the conversion: Commented Jun 18, 2015 at 15:46
  • 1
    Are you interested in the memory consumption of the import process (which can be reduced by reading and converting data serially instead of reading all at once) or in the footprint of converted_data (which will be > 1G unless you can make use of sparse structures)? Commented Jun 18, 2015 at 20:57

2 Answers 2

2

You create many bin-files with pickle and you have some code that loads and unloads the different part of your data.

Say you have a file that is 16GB, you can create 16 1GB pickle files.

If you say you have enough RAM, after the pickle files are done, you should be able to load it all in memory.

Sign up to request clarification or add additional context in comments.

Comments

2

As far as I can tell, your approach of reading the file is already quite memory efficient.

I assume that getting a file object with open will not read the whole file from the file system into RAM, but instead access the file on the file system as-needed.

You then iterate over the file object's, which yields the file's lines (strings in your case, as you've opened the file in text mode) i.e., the file object acts as a generator. Thus one can assume that no list of all lines is constructed here and that the lines are read one by one to be continuously consumed.

You do this in a list contraction. Do list contractions collect all values yielded by their right hand side (the part after the in keyword) before passing it to their left hand side (the part before the for keyword) for processing? A little experiment can tell us:

print('defining generator function')

def firstn(n):
        num = 0
        while num < n:
                print('yielding ' + str(num))
                yield num
                num += 1

print('--')

[print('consuming ' + str(i)) for i in firstn(5)]

The output of the above is

defining generator function
--
yielding 0
consuming 0
yielding 1
consuming 1
yielding 2
consuming 2
yielding 3
consuming 3
yielding 4
consuming 4

So the answer is no, each yielded value is immediately consumed by the left hand side before any other values are yielded from the right hand side. Only one line from the file should have to be kept in memory at a time.

So if the individual lines in your file aren't too long, your reading approach seems to be as memory efficient as it gets.

Off course, your list contraction still has to collect the results of the left hand side's processing. After all, the resulting list is what you want to get out of all this. So if you run out of memory, it is likely that it's the resulting list becoming too large.

I don't know if NumPy uses the fact that collections of booleans can be stored more efficiently than numbers. But if it does, you'd have to make it aware that your integers are, in fact, boolean-valued to benefit from that more memory-efficient data types:

import numpy as np
f = open ( "data.txt" , 'r')
converted_data = [ np.fromstring(line, dtype=bool, sep=',') for line in f ]

If you don't need all of converted_data at once, but rather have to be able to iterate over it, consider making it a generator, too, instead of a list. You don't need to much around with the yield keyword to achieve that. Simply replace the square brackets of the list comprehension by round braces and you've got a generator expression:

converted_data_generator = ( np.fromstring(line, dtype=bool, sep=',') for line in f )

1 Comment

As is hopefully evident from my phrasing, a lot of my conclusions are based on speculation. So if anyone knows better, please do correct me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.