I have a large number of numpy arrays in S3 stored in npz archive. What is the best way to load them into a PySpark RDD/Dataframe of NumPy arrays? I have tried to load the file using the sc.wholeTextFiles API.
rdd=sc.wholeTextFiles("s3://[bucket]/[folder_containing_npz_files]")
However numpy.load requires a file handle. And loading the file contents in memory as a string takes up a lot of memory.