Loading numpy arrays stored in npz archive in PySpark

Question

I have a large number of numpy arrays in S3 stored in npz archive. What is the best way to load them into a PySpark RDD/Dataframe of NumPy arrays? I have tried to load the file using the sc.wholeTextFiles API.

rdd=sc.wholeTextFiles("s3://[bucket]/[folder_containing_npz_files]")

However numpy.load requires a file handle. And loading the file contents in memory as a string takes up a lot of memory.

zero323 · Accepted Answer · 2016-02-08 18:20:04Z

2

You cannot do much about memory requirements but otherwise BytesIO should work just fine:

from io import BytesIO

def extract(kv):
    k, v = kv
    with BytesIO(v) as r:
        for f, x in np.load(r).items():
            yield "{0}\t{1}".format(k, f), x

sc.binaryFiles(inputPath).flatMap(extract)

answered Feb 8, 2016 at 18:20

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Loading numpy arrays stored in npz archive in PySpark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related