39

I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas dataframe using the following code:

import pandas as pd
import boto

data = pd.read_csv('s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv')

In order to give complete access I have set the bucket policy on the S3 bucket as follows:

{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
    {
        "Sid": "statement1",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::example_bucket"
    }
  ]
}

Unfortunately I still get the following error in python:

boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed

Wondering if someone could help explain how to either correctly set the permissions in AWS S3 or configure pandas correctly to import the file. Thanks!

4
  • Shouldn't there be a double slash after s3? Commented Jun 13, 2015 at 12:00
  • 1
    yes, you're right there should be. I also had to change the location of the bucket and file: tripData = pd.read_csv('htps://s3-ap-southeast-2.amazonaws.com/example_bucket/data.csv'). and I had to update the permissions on the individual file. but it works now. cheers. Commented Jun 13, 2015 at 23:05
  • 1
    Please add your solution as an Answer to help other Stackoverflow users. Commented Jun 15, 2015 at 5:19
  • 2
    When using read_csv to read files from s3, does pandas first downloads locally to disk and then loads into memory? Or does it streams from the network directly into the memory? Commented Apr 5, 2016 at 21:46

8 Answers 8

86

Using pandas 0.20.3

import boto3
import pandas as pd
import sys

if sys.version_info[0] < 3: 
    from StringIO import StringIO # Python 2.x
else:
    from io import StringIO # Python 3.x

client = boto3.client('s3')

bucket_name = 'my_bucket'

object_key = 'my_file.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')

df = pd.read_csv(StringIO(csv_string))
Sign up to request clarification or add additional context in comments.

4 Comments

When I import it this way the df's columns do not appear?
I'm trying this and I'm getting errors in the id and secret key calls to os.environ -- is that something I have to set up in terminal or something?
@ZachOakes Yes, that's something you would have needed to set up. Those two lines assume that your ID and SECRET were previously saved as environment variables, but you don't need to pull them from environment variables. Instead, you can replace those two lines with whatever method you like to get your ID and SECRET into your code.
Also works for DictReader: reader = csv.DictReader(io.StringIO(body), fieldnames=fieldnames)
10

Based on this answer that suggested using smart_open for reading from S3, this is how I used it with Pandas:

import os
import pandas as pd
from smart_open import smart_open

aws_key = os.environ['AWS_ACCESS_KEY']
aws_secret = os.environ['AWS_SECRET_ACCESS_KEY']

bucket_name = 'my_bucket'
object_key = 'my_file.csv'

path = 's3://{}:{}@{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)

df = pd.read_csv(smart_open(path))

Comments

5

You don't need pandas.. you can just use the default csv library of python

def read_file(bucket_name,region, remote_file_name, aws_access_key_id, aws_secret_access_key):
    # reads a csv from AWS

    # first you stablish connection with your passwords and region id

    conn = boto.s3.connect_to_region(
        region,
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key)

    # next you obtain the key of the csv you want to read
    # you will need the bucket name and the csv file name

    bucket = conn.get_bucket(bucket_name, validate=False)
    key = Key(bucket)
    key.key = remote_file_name
    data = key.get_contents_as_string()
    key.close()

    # you store it into a string, therefore you will need to split it
    # usually the split characters are '\r\n' if not just read the file normally 
    # and find out what they are 

    reader = csv.reader(data.split('\r\n'))
    data = []
    header = next(reader)
    for row in reader:
        data.append(row)

    return data

hope it solved your problem, good luck! :)

Comments

4

Without pandas (it's a big dependency just to read a csv file folks):

client = boto3.client("s3", region_name="eu-west-2")     
data = client.get_object(Bucket=bucket, Key=_file)    
reader = csv.DictReader(StringIO(data['Body'].read().decode('utf-8')))

1 Comment

How can I load only a fraction of the csv?
3

I eventually realised that you also need to set the permissions on each individual object within the bucket in order to extract it by using the following code:

from boto.s3.key import Key
k = Key(bucket)
k.key = 'data_1.csv'
k.set_canned_acl('public-read')

And I also had to modify the address of the bucket in the pd.read_csv command as follows:

data = pd.read_csv('https://s3-ap-southeast-2.amazonaws.com/example_bucket/data_1.csv')

2 Comments

How to modify address to become a url that can be read by pandas?
You've made this file readable by anyone in the world which most people should probably avoid doing. @jpobst's answer above that provides the correct credentials to read the file is what most folks should do.
3

You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores, such as S3.

import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/file.csv")

Comments

2

Pandas (starting with version 1.2.0) supports the ability to read and write files stored in S3 using the s3fs Python package. S3Fs is a Pythonic file interface to S3. It builds on top of botocore.

pip install s3fs

Use S3 URI.

To reading file

import pandas as pd

df = pd.read_csv("s3://my-bucket-name/sample.csv")

To writing file

import pandas as pd

df.to_csv("s3://my-bucket-name/sample.csv")

Comments

0

You can try this:

 import boto3
 import pandas as pd

 s3_client = boto3.client(
    "s3",
    aws_access_key_id = ACCESS_KEY_ID,
    aws_secret_access_key = SECRET_ACCESS_KEY,
    endpoint_url = ENDPOINT_URL
    )
response = s3_client.get_object(Bucket=BUCKET_NAME, Key=OBJECT_KEY)
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

if status == 200:
    df = pd.read_csv(response.get("Body"))
    print('Successfully read dataframe from S3')
else:
    print(f"Unsuccessful S3 get_object. Status: {status}")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.