1

I have a large dataset of texts and their corresponding labels. I used to read csv files using csv module and then build numpy arrays on that data till I found out having large text arrays in numpy is memory inefficient.

with open('sample.csv', 'r') as f: 
    data = csv.reader(f.readlines())                                                                                                                                             

texts = np.array([d[0] for d in data])

And this takes about 13GB memory. But when pandas reads the very same data, it's like nothing happened, no data is present in memory. And by this I mean it's not 50% less memory usage or even 20%, it takes just 300 MB of memory.

data = pd.read_csv('sample.csv')

texts2 = np.array(data['text'])

The only difference between texts and texts2 arrays is the dtype:

texts.dtype
dtype('<U92569')

texts2.dtype
dtype('O')
2
  • That 'U92569' dtype means the numpy array is allocating 92569 characters (4 bytes each) to each element (which I suspect is one element per line). With texts.shape we could calculate the memory useage. That 92569 is the length of the longest string. Did you look at elements of texts? Commented Apr 26, 2020 at 16:49
  • @hpaulj text elements are some comments with variable size. Commented Apr 27, 2020 at 10:11

1 Answer 1

5

Your first array is using a NumPy string dtype. Those are fixed-width, so every element of the array takes as much space as the longest string in the array, and one of the strings is 92569 characters long, driving up the space requirements for the shorter strings.

Your second array is using object dtype. That just holds references to a bunch of regular Python objects, so each element is a regular Python string object. There's additional per-element object overhead, but each string only needs enough room to hold its own data, instead of enough room to hold the biggest string in the array.

Also, NumPy unicode dtypes always use 4 bytes per character, while Python string objects use less if the string doesn't contain any high code points.

Sign up to request clarification or add additional context in comments.

1 Comment

I tried to do a memory_profile on a numpy string and its corresponding object column in a dataframe and I could visualize your answer, and interestingly, when I manipulated the column such that all strings were of the same length, the numpy string seemed to be taking much less memory than the pandas dataframe. It seems that it gets to be heavily optimized in terms of memory if all the strings are of the same length.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.