How to convert Numpy to Parquet without using Pandas?

Question

The traditional way to save a numpy object to parquet is to use Pandas as an intermediate. However, I am working with a lot of data, which doesn't fit in Pandas without crashing my enviroment because in Pandas, the data takes up a lot of RAM.

I need to save to Parquet because I am working with variable length arrays in numpy, so for that parquet actually saves to a smaller space than .npy or .hdf5 .

The following code is a minimal example that downloads a small chunk of my data, and converts between pandas objects and numpy objects to measure how much RAM they consume, and save to npy and parquet files to see how much disk space they take.

# Download sample file, about 10 mbs

from sys import getsizeof
import requests
import pickle
import numpy as np
import pandas as pd
import os

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

download_file_from_google_drive('1-0R28Yhdrq2QWQ-4MXHIZUdZG2WZK2qR', 'sample.pkl')

sampleDF = pd.read_pickle('sample.pkl')

sampleDF.to_parquet( 'test1.pqt', compression = 'brotli', index = False )

# Parquet file takes up little space 
os.path.getsize('test1.pqt')

6594712

getsizeof(sampleDF)

22827172

sampleDF['totalCites2'] = sampleDF['totalCites2'].apply(lambda x: np.array(x))

#RAM reduced if the variable length batches are in numpy
getsizeof(sampleDF)

22401764

#Much less RAM as a numpy object 
sampleNumpy = sampleDF.values
getsizeof(sampleNumpy)

112

# Much more space in .npy form 
np.save( 'test2.npy', sampleNumpy) 
os.path.getsize('test2.npy')

20825382

# Numpy savez. Not as good as parquet 
np.savez_compressed( 'test3.npy', sampleNumpy )
os.path.getsize('test3.npy.npz')

9873964

That 112 number is meaningless. In general sys.getsizeof is not a good measure of memory use. — hpaulj
– hpaulj, Commented Aug 27, 2019 at 23:40
For ndarray nbytes. That's just the number of elements times the size of each element (typically 4-8bytes). A DataFrame might store its data in a similar sized array. But if you have arrays of arrays or lists (object dtype) then you have to take into account the size of those objects. There's no one number or measure; you have to understand how the data object is structured. — hpaulj
– hpaulj, Commented Aug 28, 2019 at 0:18
If there are many repeated values in columns then pandas sparse data structures may help - see pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html for documentation of pandas.SparseArray and others — Andrew Lavers
– Andrew Lavers, Commented Aug 28, 2019 at 0:35
Using your notebook I pickle.load the sample.pkl file. The result was a DataFrame. In other words, given the source, you can't bypass pandas. That's the version with lists in the second column. Your apply command converts those to arrays, though with lengths of 5 to100 that doesn't seem to make much difference. It's an object dtype column. — hpaulj
– hpaulj, Commented Aug 28, 2019 at 3:56

TalP · Accepted Answer · 2020-12-27 08:25:40Z

23

You can read/write numpy arrays to parquet directly using Apache Arrow (pyarrow), which is also the underlying backend to parquet in pandas. Note that parquet is a tabular format, so creating some table is still necessary.

import numpy as np
import pyarrow as pa

np_arr = np.array([1.3, 4.22, -5], dtype=np.float32)
pa_table = pa.table({"data": np_arr})
pa.parquet.write_table(pa_table, "test.parquet")

refs: numpy to pyarrow, pyarrow.parquet.write_table

answered Dec 27, 2020 at 8:25

TalP

2463 silver badges3 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bitbang · Accepted Answer · 2021-11-16 02:12:13Z

5

Parquet format can be written using pyarrow, the correct import syntax is:

import pyarrow.parquet as pq so you can use pq.write_table. Otherwise using import pyarrow as pa, pa.parquet.write_table will return: AttributeError: module 'pyarrow' has no attribute 'parquet'.

Pyarrow requires the data to be organized columns-wise, which means in the case of numpy multidimensional arrays, you need to assign each dimension to a specific field in the parquet column.

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq


ndarray = np.array(
    [
        [4.96266477e05, 4.55342071e06, -1.03240000e02, -3.70000000e01, 2.15592864e01],
        [4.96258372e05, 4.55344875e06, -1.03400000e02, -3.85000000e01, 2.40120775e01],
        [4.96249387e05, 4.55347732e06, -1.03330000e02, -3.47500000e01, 2.70718535e01],
    ]
)

ndarray_table = pa.table(
    {
        "X": ndarray[:, 0],
        "Y": ndarray[:, 1],
        "Z": ndarray[:, 2],
        "Amp": ndarray[:, 3],
        "Ang": ndarray[:, 4],
    }
)

pq.write_table(ndarray_table, "ndarray.parquet")

edited Nov 16, 2021 at 2:12

bitbang

2,27218 silver badges20 bronze badges

answered Feb 23, 2021 at 20:48

epifanio

1,3671 gold badge17 silver badges28 bronze badges

1 Comment

bitbang Over a year ago

Or you can just use 'import pyarrow.parquet', 'import pyarrow as pa' combo

Collectives™ on Stack Overflow

How to convert Numpy to Parquet without using Pandas?

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related