I have a large dataset of texts and their corresponding labels. I used to read csv files using csv module and then build numpy arrays on that data till I found out having large text arrays in numpy is memory inefficient.
with open('sample.csv', 'r') as f:
data = csv.reader(f.readlines())
texts = np.array([d[0] for d in data])
And this takes about 13GB memory. But when pandas reads the very same data, it's like nothing happened, no data is present in memory. And by this I mean it's not 50% less memory usage or even 20%, it takes just 300 MB of memory.
data = pd.read_csv('sample.csv')
texts2 = np.array(data['text'])
The only difference between texts and texts2 arrays is the dtype:
texts.dtype
dtype('<U92569')
texts2.dtype
dtype('O')
texts.shapewe could calculate the memory useage. That 92569 is the length of the longest string. Did you look at elements oftexts?