9

is there anyway to load/read an external file(i.e, AWS S3) in numpy?. I have several npy files stored in S3. I have tried to access them through a S3 presigned url but it seems neither numpy.load method or np.genfromtxt are able to read them.

I wouldn't want to save files on local file system and then load them on numpy.

Any idea?

4
  • Of course you need some extra-layer doing all the web-protocol work! Numpy's IO is probably designed for file-based IO only. In Python3, you could try import request; import BytesIO; request = requests.get(url); np.load(BytesIO(request.content)). Commented Nov 15, 2016 at 11:26
  • Of course my snippet is assuming the S3-link is a public one without the need for authentication. I don't know if that's the case. If not, you would need some library doing this auth for accessing the files! Commented Nov 15, 2016 at 11:33
  • Are you able to read the files using requests? Commented Nov 15, 2016 at 13:35
  • Hi, wasn't able to read. Finally, I'm using spark textFiles that makes it possible. thx!! Commented Nov 30, 2016 at 9:17

3 Answers 3

13

Using s3fs

import numpy as np
from s3fs.core import S3FileSystem
s3 = S3FileSystem()

key = 'your_file.npy'
bucket = 'your_bucket'

df = np.load(s3.open('{}/{}'.format(bucket, key)))

You might have to set the allow_pickle=True depending on your file to be read.

Sign up to request clarification or add additional context in comments.

1 Comment

Does this approach only copy across from S3 the bytes necessary? When I np.load a multi-value npz file locally, it seems to just load an index into memory, and only when I access a value does it take a while to load that value (eg, f=np.load(filename);f['myvalue']). Does this do that? (I tried to test myself, but got very variable measurements, possibly because my ec2 is busy, asking because it will be for another week or so.) Although from next answer looks like s3fs is not suitable if performance matters.
7

I've compared s3fs and io.BytesIO for loading a 28G npz file from s3. s3fs takes 30 min while io takes 12 min.

obj = s3_session.resource("s3").Object(bucket, key)
with io.BytesIO(obj.get()["Body"].read()) as f:
    f.seek(0)  # rewind the file
    X, y = np.load(f).values()
s3fs = S3FileSystem()
with s3fs.open(f"s3://{bucket}/{key}") as s3file:
     X, y = np.load(s3file).values()

1 Comment

s3fs is really slow even for a few npys
1

I had success using boto and StringIO. Connect to S3 using boto and get your bucket. Then read the file with following code into numpy:

  import numpy as np
  from StringIO import StringIO
  key=bucket.get_key('YOUR_KEY')
  data_string=StringIO(key.get_contents_as_string())
  data = np.load(data_string)

I am not sure it's the most efficient way, but it doesn't require a public URL.

Cheers, Michael

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.