Load npy file from S3 in python

Question

is there anyway to load/read an external file(i.e, AWS S3) in numpy?. I have several npy files stored in S3. I have tried to access them through a S3 presigned url but it seems neither numpy.load method or np.genfromtxt are able to read them.

I wouldn't want to save files on local file system and then load them on numpy.

Any idea?

Of course you need some extra-layer doing all the web-protocol work! Numpy's IO is probably designed for file-based IO only. In Python3, you could try import request; import BytesIO; request = requests.get(url); np.load(BytesIO(request.content)). — sascha
– sascha, Commented Nov 15, 2016 at 11:26
Of course my snippet is assuming the S3-link is a public one without the need for authentication. I don't know if that's the case. If not, you would need some library doing this auth for accessing the files! — sascha
– sascha, Commented Nov 15, 2016 at 11:33
Hi, wasn't able to read. Finally, I'm using spark textFiles that makes it possible. thx!! — Ivan Fernandez
– Ivan Fernandez, Commented Nov 30, 2016 at 9:17

hru_d · Accepted Answer · 2019-08-06 19:57:32Z

13

Using s3fs

import numpy as np
from s3fs.core import S3FileSystem
s3 = S3FileSystem()

key = 'your_file.npy'
bucket = 'your_bucket'

df = np.load(s3.open('{}/{}'.format(bucket, key)))

You might have to set the allow_pickle=True depending on your file to be read.

answered Aug 6, 2019 at 19:57

hru_d

94611 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chris Over a year ago

Does this approach only copy across from S3 the bytes necessary? When I np.load a multi-value npz file locally, it seems to just load an index into memory, and only when I access a value does it take a while to load that value (eg, f=np.load(filename);f['myvalue']). Does this do that? (I tried to test myself, but got very variable measurements, possibly because my ec2 is busy, asking because it will be for another week or so.) Although from next answer looks like s3fs is not suitable if performance matters.

Jing Xue · Accepted Answer · 2020-10-28 20:59:42Z

7

I've compared s3fs and io.BytesIO for loading a 28G npz file from s3. s3fs takes 30 min while io takes 12 min.

obj = s3_session.resource("s3").Object(bucket, key)
with io.BytesIO(obj.get()["Body"].read()) as f:
    f.seek(0)  # rewind the file
    X, y = np.load(f).values()

s3fs = S3FileSystem()
with s3fs.open(f"s3://{bucket}/{key}") as s3file:
     X, y = np.load(s3file).values()

answered Oct 28, 2020 at 20:59

Jing Xue

4294 silver badges5 bronze badges

1 Comment

jsa Over a year ago

s3fs is really slow even for a few npys

Michael Gygli · Accepted Answer · 2016-12-30 10:34:23Z

1

I had success using boto and StringIO. Connect to S3 using boto and get your bucket. Then read the file with following code into numpy:

  import numpy as np
  from StringIO import StringIO
  key=bucket.get_key('YOUR_KEY')
  data_string=StringIO(key.get_contents_as_string())
  data = np.load(data_string)

I am not sure it's the most efficient way, but it doesn't require a public URL.

Cheers, Michael

answered Dec 30, 2016 at 10:34

Michael Gygli

9207 silver badges12 bronze badges

Collectives™ on Stack Overflow

Load npy file from S3 in python

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related