Why does str array and object array of same data are different wrt memory usage?

Question

I have a large dataset of texts and their corresponding labels. I used to read csv files using csv module and then build numpy arrays on that data till I found out having large text arrays in numpy is memory inefficient.

with open('sample.csv', 'r') as f: 
    data = csv.reader(f.readlines())                                                                                                                                             

texts = np.array([d[0] for d in data])

And this takes about 13GB memory. But when pandas reads the very same data, it's like nothing happened, no data is present in memory. And by this I mean it's not 50% less memory usage or even 20%, it takes just 300 MB of memory.

data = pd.read_csv('sample.csv')

texts2 = np.array(data['text'])

The only difference between texts and texts2 arrays is the dtype:

texts.dtype
dtype('<U92569')

texts2.dtype
dtype('O')

That 'U92569' dtype means the numpy array is allocating 92569 characters (4 bytes each) to each element (which I suspect is one element per line). With texts.shape we could calculate the memory useage. That 92569 is the length of the longest string. Did you look at elements of texts? — hpaulj
– hpaulj, Commented Apr 26, 2020 at 16:49

user2357112 · Accepted Answer · 2020-04-26 11:34:18Z

5

Your first array is using a NumPy string dtype. Those are fixed-width, so every element of the array takes as much space as the longest string in the array, and one of the strings is 92569 characters long, driving up the space requirements for the shorter strings.

Your second array is using object dtype. That just holds references to a bunch of regular Python objects, so each element is a regular Python string object. There's additional per-element object overhead, but each string only needs enough room to hold its own data, instead of enough room to hold the biggest string in the array.

Also, NumPy unicode dtypes always use 4 bytes per character, while Python string objects use less if the string doesn't contain any high code points.

edited Apr 26, 2020 at 11:34

answered Apr 26, 2020 at 11:05

user2357112

286k32 gold badges490 silver badges569 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cavin Dsouza Over a year ago

I tried to do a memory_profile on a numpy string and its corresponding object column in a dataframe and I could visualize your answer, and interestingly, when I manipulated the column such that all strings were of the same length, the numpy string seemed to be taking much less memory than the pandas dataframe. It seems that it gets to be heavily optimized in terms of memory if all the strings are of the same length.

Collectives™ on Stack Overflow

Why does str array and object array of same data are different wrt memory usage?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related