Pandas DataFrame and NumPy array weirdness - df.to_numpy(), np.asarray(df), and np.array(df) give different memory usages

Question

I am working on converting an existing Pandas Dataframe to a Numpy array. The dataframe has no NaN values and is not sparsely populated (read in from a .csv file). In addition, in order to see memory usage, I performed the following:

sum(df.memory_usage)

sys.getsizeof(df)

The above small 16 byte difference is negligible, and understood since the size calculation is different and overhead of memory usage is different when using sys.getsizeof vs. df.memory_usage and summing it (for reference, df.info() or using the pandas_profiling library.

Now, when converting this to a Numpy array, there seems to be a huge discrepancy in memory usage:

sys.getsizeof(np.array(df))

sys.getsizeof(df.to_numpy())

To me, this does not make any sense, since both arrays are of the same type and also same size and data:

np.array(df)

array([[1.0000e+00, 2.0000e+04, 2.0000e+00, ..., 0.0000e+00, 0.0000e+00,
    1.0000e+00],
   [2.0000e+00, 1.2000e+05, 2.0000e+00, ..., 0.0000e+00, 2.0000e+03,
    1.0000e+00],
   [3.0000e+00, 9.0000e+04, 2.0000e+00, ..., 1.0000e+03, 5.0000e+03,
    0.0000e+00],
   ...,
   [1.1998e+04, 9.0000e+04, 1.0000e+00, ..., 3.0000e+03, 4.0000e+03,
    0.0000e+00],
   [1.1999e+04, 2.8000e+05, 1.0000e+00, ..., 3.5000e+02, 2.0950e+03,
    0.0000e+00],
   [1.2000e+04, 2.0000e+04, 1.0000e+00, ..., 0.0000e+00, 0.0000e+00,
    1.0000e+00]])

df.to_numpy() # or similarly, np.asarray(df)

array([[1.0000e+00, 2.0000e+04, 2.0000e+00, ..., 0.0000e+00, 0.0000e+00,
    1.0000e+00],
   [2.0000e+00, 1.2000e+05, 2.0000e+00, ..., 0.0000e+00, 2.0000e+03,
    1.0000e+00],
   [3.0000e+00, 9.0000e+04, 2.0000e+00, ..., 1.0000e+03, 5.0000e+03,
    0.0000e+00],
   ...,
   [1.1998e+04, 9.0000e+04, 1.0000e+00, ..., 3.0000e+03, 4.0000e+03,
    0.0000e+00],
   [1.1999e+04, 2.8000e+05, 1.0000e+00, ..., 3.5000e+02, 2.0950e+03,
    0.0000e+00],
   [1.2000e+04, 2.0000e+04, 1.0000e+00, ..., 0.0000e+00, 0.0000e+00,
    1.0000e+00]])

I found out that df.to_numpy() uses np.asarray in order to perform the conversion, so I also tried this:

sys.getsizeof(np.asarray(df))

Both np.asarray(df) and df.to_numpy() gives a total of 120 bytes of use, while np.array(df) is 2400120 bytes! This does not make any sense!

Both arrays are not stored as a sparse array, and as shown above, have the same exact output (and by checking types, this is the same).

No idea how to resolve this issue, since this doesn't seem to make any sense from a memory perspective. I'm trying to understand this huge discrepancy in the memory usage, since all the values are integers or floats in the .csv file, and no missing or NaN values are present. Perhaps np.asarray(df) (and hence df.to_numpy()) is doing something differently from np.array(df), or sys.getsizeof is doing something weird, but I cannot seem to resolve this issue.

As a general rule sys.getsizeof is not a useful measure of memory usage - unless you thoroughly understand how that object is organized, and what values are stored by reference, etc. — hpaulj
– hpaulj, Commented Jul 1, 2021 at 15:30

hpaulj · Accepted Answer · 2021-07-01 20:29:45Z

A numpy has attributes like shape and dtype, and a data buffer, which is a flat C array, that stores the values.

arr.nbytes    # 2400000

is telling you the size of that data buffer. So if the array is (300,10000) float dtype, that would be 300*1000*8 bytes.

getsizeof 2400120 is reporting on that buffer plus 120 bytes used for the array object itself, the shape tuple and dtype, etc.

But an array may be view of another. It will have its own 120 'overhead', but reference the data buffer of another array. getsizeof only reports on that 120, not the shared memory. In effect it tells us how much extra memory that view is consuming.

A dataframe is a complex object with index arrays, column name list (or array),etc. How the data is stored depends on column dtypes. Columns may be viewed as Series, or groups of columns of like dtype. I think in your case all columns have the same dtype, so the data is stored in a 2d numpy array. It's the data buffer of that array that the dataframe getsizeof is reporting.

df.values
dt.to_numpy()

return a view of that data array. Thus getsizeof only reports 120.

np.array(df) returns a copy of that array, which has its own data buffer, and thus the full size. Read its docs.

np.asarray(df) has a copy=False parameter, and thus returns a view if possible.

In sum, the concept of a view is key to understanding the differences you see. sys.getsizeof is not that useful of a measure, unless you already understand how objects are organized. It's a good idea to check the documentation of functions that you use, including np.array, np.asarray, and .to_numpy.

qxzsilver · Accepted Answer · 2021-07-01 14:52:49Z

Based on a bit more analysis after digging around, I think I found an answer to this question (although it isn't satisfactory in my opinion).

First, I stored each of the values separately in order to make sure the values are assigned to a variable:

nparr1 = np.array(df)
nparr2 = df.to_numpy()

Next, the type and size of each item was compared, which shows that there is no difference in the formatting or storage in each of these arrays. This is quite baffling, and then the following were discovered for numpy arrays itemsize and size. Next, .itemsize and .size for each item was performed:

nparr1.itemsize * nparr1.size

nparr2.itemsize * nparr2.size

How odd! Now the values are matching up. This can also be checked using the bytes used with nbytes, which yield the same values as above.

nparr1.nbytes

nparr2.nbytes

So it wasn't that there was a magical compression algorithm, but the memory usage is all there. It just seems sys.getsizeof has weird behavior for some reason (which is still not resolved). However, notice that the difference from the question above shows the following:

sys.getsizeof(np.array(df))

sys.getsizeof(df.to_numpy())

Now, oddly enough, nparr1.nbytes yielded 2400000. It seems that 2400120 - 2400000 = 120. Thus, it seems that sys.getsizeof(df.to_numpy()) is yielding the overhead cost (perhaps the pointer to this memory address) and sys.getsizeof(np.array(df)) yields the full memory payload of 2400000 PLUS the overhead of 120, which is 2400120. I hope this is the correct analysis, and if anyone else has other insights or what is actually going on behind the scenes of df.to_numpy()/np.asarray(df) vs. np.array(df) and how the data is stored in-memory, and the weird behavior of sys.getsizeof, I would be happy to find out more of what the difference in memory operations is.

I have checked this with a colleague and this is what we have settled on (my colleague also independently tested this on the same .csv file and arrived at the same conclusion). However, it is not a satisfactory answer as to what is truly happening behind-the-scenes and this unexpected, weird behavior for the exact same numpy array.

In your reading about numpy, have you come across the difference between a view and copy? Or the basics on how a numpy array is stored? Also have you read the docs for np.array and np.asarray? What is different?

Collectives™ on Stack Overflow

Pandas DataFrame and NumPy array weirdness - df.to_numpy(), np.asarray(df), and np.array(df) give different memory usages

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related