6

I'm a noob to AWS and lambda, so I apologize if this is a dumb question. What I would like to be able to do is load a spreadsheet into an s3 bucket, trigger lambda based on that upload, have lambda load the csv into pandas and do stuff with it, then write the dataframe back to a csv into a second s3 bucket.

I've read a lot about zipping a python script and all the libraries and dependencies and uploading that, and thats a separate question. I've also figured out how to trigger lambda upon uploading a file to an S3 bucket and to automatically copy that file to a second s3 bucket.

The part I'm having trouble finding any information on is that middle part, the loading the file into pandas and manipulating the file within pandas all inside the lambda function.

First question: Is something like that even possible? Second question: how do I "grab" the file from the s3 bucket and load it into pandas? would it be something like this?

import pandas as pd
import boto3
import json
s3 = boto3.resource('s3')

def handler(event, context):
     dest_bucket = s3.Bucket('my-destination-bucket')
     df = pd.read_csv(event['Records'][0]['s3']['object']['key'])
     # stuff to do with dataframe goes here

     s3.Object(dest_bucket.name, <code for file key>).copy_from(CopySource = df)

? I really have no idea if that's even close to right and is a complete shot in the dark. Any and all help would be really appreciated, because I'm pretty obviously out of my element!

4
  • 1
    It should be possible, see the following question to read the file from s3 to pandas. stackoverflow.com/questions/37703634/… Commented Jan 16, 2018 at 22:44
  • Thanks for your response. It seems that response is more for accessing a file in an s3 bucket, but Lambda isn't used at all, but seems to rather be just a normal python script. How would I make modifications to do that within a an AWS lambda function as per my question? Commented Jan 16, 2018 at 22:55
  • you can use the python script within your handler method or write a separate method. it explains the step to do it ,in your case you need to put that inside the lamda function since you have already configured the lamda trigger it should work. Commented Jan 16, 2018 at 22:59
  • 1
    It looks like you are passing the S3 object key to the pandas read_csv() method. An S3 key is of the form dir1/dir2/file.csv. What you need is the S3 URI for the object, and that's of the form s3://bucket/dir1/dir2/file.csv. So, construct the proper URI from the bucket and key in the event object and then pass it to pandas read_csv(). Commented Jan 17, 2018 at 0:49

1 Answer 1

1

This code triggers a Lambda function on PUTS, then GETS it, then PUTS it into another bucket:

from __future__ import print_function
import os
import time
import json
import boto3

s3 = boto3.client('s3')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = quote(event['Records'][0]['s3']['object']['key'].encode('utf8'))
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        s3_upload_article(response, bucket, end_path)
        return response['ContentType']
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

def s3_upload_article(html, bucket, end_path):
    s3.put_object(Body=html, Bucket=bucket, Key=end_path, ContentType='text/html', ACL='public-read')

I broke this code out from a more complicated Lambda script I have written, however, I hope it displays some of what you need to do. The PUTS of the object only triggers the scipt. Any other actions that occur after the event is triggered are up to you to code into the script.

bucket = event['Records'][0]['s3']['bucket']['name']
key = quote(event['Records'][0]['s3']['object']['key'].encode('utf8'))

Bucket and key in the first few lines are the bucket and key of the object that triggered the event. Everything else is up to you.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for your response. I'm still having trouble. I can successfully load my csv into pandas and manipulate it, but I'm really struggling with how to then take my dataframe, turn it back into a csv, and then put that file into a new bucket. Could you lend any clarity on how I can accomplish that?
@Tkelly I have never used Panda's before but it appears there is pandas.DataFrame.to_csv function that may accomplish this? pandas.pydata.org/pandas-docs/stable/generated/… Are you having trouble on that step or the PUTS?
@Tkelly Did you checked this post ? stackoverflow.com/questions/38154040/… . It explains the steps for you to write the dataframe to s3 bucket directly.
@UsmanAzhar I hadn't found that question before, thanks for pointing it out. I think that might be exactly what I need.
@NicholasMartinez the pd.to_csv() method doesn't work on it's own as s3 wants the data in a different format. I think @UsmanAzhar pointed me in the right direction. I'm gonna give it a try and find out! Thanks again to both of you for the assistance.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.