Automate CSV file cleaning with AWS lambda using a trigger

Question

I am trying to create a Lambda function which will clean automatically csv files from an S3 bucket. The S3 bucket receives files every 5mn, and I have therefore created a trigger for the Lambda function. To clean the csv files I will use pandas library to create a dataframe. I have already installed a pandas layer. When creating a dataframe, there is an error message. This is my code:

import json
import boto3
import pandas as pd
from io import StringIO


#call s3 bucket
client = boto3.client('s3')

def lambda_handler(event, context):
    
    #define bucket_name and object _name
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_name = event['Records'][0]['s3']['object']['key']
    
    #create a df from the object
    df = pd.read_csv(object_name)

This is the error message:

[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'object_name'

On Cloudwatch it additionally says:

OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k

Has anyone experienced the same issues? Thanks in advance for all your help!

read_csv("object_name") - I hope you noticed that "object_name" is a string here and not the actual variable declared 2 lines above. — Ervin Szilagyi
– Ervin Szilagyi, Commented Jun 4, 2022 at 12:30
Thanks for your answer! I have changed it to read_csv(object_name), and I get the following error message: "errorMessage": "[Errno 2] No such file or directory: 'test%2Fkey'", — Sophie
– Sophie, Commented Jun 4, 2022 at 12:37

Mimi · Accepted Answer · 2022-06-04 12:49:47Z

2

You have to use the s3 client to download the file from s3 before using pandas. Something like:

response = client.get_object(Bucket=bucket_name, Key=object_name)
df = pd.read_csv(response["Body"])

You'll have to make sure lambda has the right permissions to access the s3 bucket.

answered Jun 4, 2022 at 12:49

Mimi

912 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

imkar · Accepted Answer · 2022-06-04 12:58:16Z

0

Change this line:

df = pd.read_csv("object_name")

to this:

df = pd.read_csv(object_name)

answered Jun 4, 2022 at 12:58

imkar

93 bronze badges

3 Comments

Sophie Over a year ago

thank you ! just updated it, however still an error message..

Sophie Over a year ago

I changed it in the post, thanks for your insight!

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Shubham Sharma · Accepted Answer · 2022-06-04 13:14:26Z

Cause of error

object_name is just a relative path(key) of the s3 object with respect to bucket and it has no significance without the bucket_name hence when you are trying to read the csv file you are getting FileNotFoundError

Solution for the error

In order to properly refer the s3 object you have to construct the fully qualified s3 path from bucket_name and object_name. Also notice that the object key has some quoted characters so before constructing the fully qualified path you have to unquote them.

from urllib.parse import unquote_plus

def lambda_handler(event, context):
    
    #define bucket_name and object _name
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_name = event['Records'][0]['s3']['object']['key']
    
    #create a df from the object
    filepath = f's3://{bucket_name}/{unquote_plus(object_name)}'
    df = pd.read_csv(filepath)

Collectives™ on Stack Overflow

Automate CSV file cleaning with AWS lambda using a trigger

3 Answers 3

Comments

3 Comments

Cause of error

Solution for the error

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Cause of error

Solution for the error

Comments

Your Answer

Sign up or log in

Post as a guest

Related