convert a text of binary values to numpy file

Question

How can one convert a huge text file (>16G) containing binary-valued characters (0 and 1) to a numpy array file without blowing up the memory in python? Assuming we have enough storage on the machine but not enough RAM for the conversion.

Sample data:

0,0,0,0,0,1,0,0,0 
1,0,0,1,0,0,0,0,0
...

Sample code:

converted_data = [ map(int,line.split(',')) for line in f ]

What do you mean by "binary-valued strings"? What's the exact format of the existing file? You might be able to do something with a memory-mapped array, depending on the existing format. — user2357112
– user2357112, Commented Jun 17, 2015 at 22:42
This is a good topic but there's not enough information given to provide a useful answer. Can you post the code that would work to read a smaller file? If you can't do that, please provide more information about the format of the file, etc. — tom10
– tom10, Commented Jun 17, 2015 at 23:24
The test file contains "0" and "1" characters separated by comma. — AAA
– AAA, Commented Jun 18, 2015 at 15:45
Are you interested in the memory consumption of the import process (which can be reduced by reading and converting data serially instead of reading all at once) or in the footprint of converted_data (which will be > 1G unless you can make use of sparse structures)? — das-g
– das-g, Commented Jun 18, 2015 at 20:57

Lazik · Accepted Answer · 2015-06-22 10:39:09Z

2

You create many bin-files with pickle and you have some code that loads and unloads the different part of your data.

Say you have a file that is 16GB, you can create 16 1GB pickle files.

If you say you have enough RAM, after the pickle files are done, you should be able to load it all in memory.

edited Jun 22, 2015 at 10:39

answered Jun 19, 2015 at 18:19

Lazik

2,5302 gold badges27 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

das-g · Accepted Answer · 2015-06-20 12:13:51Z

As far as I can tell, your approach of reading the file is already quite memory efficient.

I assume that getting a file object with open will not read the whole file from the file system into RAM, but instead access the file on the file system as-needed.

You then iterate over the file object's, which yields the file's lines (strings in your case, as you've opened the file in text mode) i.e., the file object acts as a generator. Thus one can assume that no list of all lines is constructed here and that the lines are read one by one to be continuously consumed.

You do this in a list contraction. Do list contractions collect all values yielded by their right hand side (the part after the in keyword) before passing it to their left hand side (the part before the for keyword) for processing? A little experiment can tell us:

print('defining generator function')

def firstn(n):
        num = 0
        while num < n:
                print('yielding ' + str(num))
                yield num
                num += 1

print('--')

[print('consuming ' + str(i)) for i in firstn(5)]

The output of the above is

defining generator function
--
yielding 0
consuming 0
yielding 1
consuming 1
yielding 2
consuming 2
yielding 3
consuming 3
yielding 4
consuming 4

So the answer is no, each yielded value is immediately consumed by the left hand side before any other values are yielded from the right hand side. Only one line from the file should have to be kept in memory at a time.

So if the individual lines in your file aren't too long, your reading approach seems to be as memory efficient as it gets.

Off course, your list contraction still has to collect the results of the left hand side's processing. After all, the resulting list is what you want to get out of all this. So if you run out of memory, it is likely that it's the resulting list becoming too large.

I don't know if NumPy uses the fact that collections of booleans can be stored more efficiently than numbers. But if it does, you'd have to make it aware that your integers are, in fact, boolean-valued to benefit from that more memory-efficient data types:

import numpy as np
f = open ( "data.txt" , 'r')
converted_data = [ np.fromstring(line, dtype=bool, sep=',') for line in f ]

If you don't need all of converted_data at once, but rather have to be able to iterate over it, consider making it a generator, too, instead of a list. You don't need to much around with the yield keyword to achieve that. Simply replace the square brackets of the list comprehension by round braces and you've got a generator expression:

converted_data_generator = ( np.fromstring(line, dtype=bool, sep=',') for line in f )

As is hopefully evident from my phrasing, a lot of my conclusions are based on speculation. So if anyone knows better, please do correct me.

Collectives™ on Stack Overflow

convert a text of binary values to numpy file

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related