The traditional way to save a numpy object to parquet is to use Pandas as an intermediate. However, I am working with a lot of data, which doesn't fit in Pandas without crashing my enviroment because in Pandas, the data takes up a lot of RAM.
I need to save to Parquet because I am working with variable length arrays in numpy, so for that parquet actually saves to a smaller space than .npy or .hdf5 .
The following code is a minimal example that downloads a small chunk of my data, and converts between pandas objects and numpy objects to measure how much RAM they consume, and save to npy and parquet files to see how much disk space they take.
# Download sample file, about 10 mbs
from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')
sampleDF = pd.read_pickle('sample.pkl')
sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )
# Parquet file takes up little space
os.path.getsize('test1.pqt')
6594712
getsizeof(sampleDF)
22827172
sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))
#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)
22401764
#Much less RAM as a numpy object
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)
112
# Much more space in .npy form
np.save( 'test2.npy', sampleNumpy)
os.path.getsize('test2.npy')
20825382
# Numpy savez. Not as good as parquet
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')
9873964
sys.getsizeofis not a good measure of memory use.ndarraynbytes. That's just the number of elements times the size of each element (typically 4-8bytes). ADataFramemight store its data in a similar sized array. But if you have arrays of arrays or lists (object dtype) then you have to take into account the size of those objects. There's no one number or measure; you have to understand how the data object is structured.pandas.SparseArrayand otherspickle.loadthesample.pklfile. The result was aDataFrame. In other words, given the source, you can't bypass pandas. That's the version with lists in the second column. Yourapplycommand converts those to arrays, though with lengths of 5 to100 that doesn't seem to make much difference. It's an object dtype column.