I am working on converting an existing Pandas Dataframe to a Numpy array. The dataframe has no NaN values and is not sparsely populated (read in from a .csv file). In addition, in order to see memory usage, I performed the following:
sum(df.memory_usage)
2400128
sys.getsizeof(df)
2400144
The above small 16 byte difference is negligible, and understood since the size calculation is different and overhead of memory usage is different when using sys.getsizeof vs. df.memory_usage and summing it (for reference, df.info() or using the pandas_profiling library.
Now, when converting this to a Numpy array, there seems to be a huge discrepancy in memory usage:
sys.getsizeof(np.array(df))
2400120
sys.getsizeof(df.to_numpy())
120
To me, this does not make any sense, since both arrays are of the same type and also same size and data:
np.array(df)
array([[1.0000e+00, 2.0000e+04, 2.0000e+00, ..., 0.0000e+00, 0.0000e+00,
1.0000e+00],
[2.0000e+00, 1.2000e+05, 2.0000e+00, ..., 0.0000e+00, 2.0000e+03,
1.0000e+00],
[3.0000e+00, 9.0000e+04, 2.0000e+00, ..., 1.0000e+03, 5.0000e+03,
0.0000e+00],
...,
[1.1998e+04, 9.0000e+04, 1.0000e+00, ..., 3.0000e+03, 4.0000e+03,
0.0000e+00],
[1.1999e+04, 2.8000e+05, 1.0000e+00, ..., 3.5000e+02, 2.0950e+03,
0.0000e+00],
[1.2000e+04, 2.0000e+04, 1.0000e+00, ..., 0.0000e+00, 0.0000e+00,
1.0000e+00]])
df.to_numpy() # or similarly, np.asarray(df)
array([[1.0000e+00, 2.0000e+04, 2.0000e+00, ..., 0.0000e+00, 0.0000e+00,
1.0000e+00],
[2.0000e+00, 1.2000e+05, 2.0000e+00, ..., 0.0000e+00, 2.0000e+03,
1.0000e+00],
[3.0000e+00, 9.0000e+04, 2.0000e+00, ..., 1.0000e+03, 5.0000e+03,
0.0000e+00],
...,
[1.1998e+04, 9.0000e+04, 1.0000e+00, ..., 3.0000e+03, 4.0000e+03,
0.0000e+00],
[1.1999e+04, 2.8000e+05, 1.0000e+00, ..., 3.5000e+02, 2.0950e+03,
0.0000e+00],
[1.2000e+04, 2.0000e+04, 1.0000e+00, ..., 0.0000e+00, 0.0000e+00,
1.0000e+00]])
I found out that df.to_numpy() uses np.asarray in order to perform the conversion, so I also tried this:
sys.getsizeof(np.asarray(df))
120
Both np.asarray(df) and df.to_numpy() gives a total of 120 bytes of use, while np.array(df) is 2400120 bytes! This does not make any sense!
Both arrays are not stored as a sparse array, and as shown above, have the same exact output (and by checking types, this is the same).
No idea how to resolve this issue, since this doesn't seem to make any sense from a memory perspective. I'm trying to understand this huge discrepancy in the memory usage, since all the values are integers or floats in the .csv file, and no missing or NaN values are present. Perhaps np.asarray(df) (and hence df.to_numpy()) is doing something differently from np.array(df), or sys.getsizeof is doing something weird, but I cannot seem to resolve this issue.
sys.getsizeofis not a useful measure of memory usage - unless you thoroughly understand how that object is organized, and what values are stored by reference, etc.