6

This post is to get basic information/links to understand running Py code on either Lambda or EC2.

My code structure is pretty simple:

import numpy as np
import pandas as pd
#load more packages

input_data = pd.read_csv(...)

def do_stuff(input, parameters):
     action1
     action2
     output.to_csv(...)
     plt.save_fig(...)

do_stuff(input_data,input_parameter)

I need to run this code on AWS, but I am not sure which to use: Lambda or EC2. Also, the input file is on my local PC, and output gets saved to a specific folder. Do I need to save it to the S3. If so, how does the path look like? Do I still use import os

I'm sorry for this noob like question. I need some starting guidance on what should I read to get started. Going through the AWS documentation becomes technical - and from the "Hello World" on Lambda - I couldn't understand much. Due to the lockdown, I'm unable to use my office desktop, and my personal mac cannot handle the loads. The input and output files are pretty small - cumulatively less than 5 MB (there are multiple input files).

4
  • A couple of relevant questions: Where do the input files come from ? How often do you have to run this function? Commented Jun 20, 2020 at 18:14
  • These input files are from a dataset that I've cleaned. I essentially am looking at AWS to run my code rather than running them on my personal PC. There is no schedule for running the files - I need to run it to create outputs which I further analyse through basic stats stuff. Commented Jun 20, 2020 at 18:17
  • How often does this need to run? What would trigger it to run? Commented Jun 20, 2020 at 18:38
  • I run it whenever I need output data with revised parameters. Its not triggered by any live event per see - as and when I make progress with my research work to identify new parameter values. Commented Jun 20, 2020 at 18:53

2 Answers 2

2

If this is something that needs to be done often, you could potentially create a workflow where you:

  1. Upload the input .csv file into a S3 bucket

  2. Your AWS Lambda function listens for changes in the S3 bucket and your code is triggered to run when a new file is uploaded.

  3. Your code saves the output .csv to a second S3 bucket.

The code might look very roughly like this (modified from this example ):

import boto3
import os
import sys
from urllib.parse import unquote_plus

s3_client = boto3.client('s3')
def handle_csv(original_csv_path, output_csv_path):
    <process csv code>

def lambda_handler(event, context):
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])
        tmpkey = key.replace('/', '')
        download_path = <insert path here>
        upload_path = <insert path here>
        s3_client.download_file(bucket, key, download_path)
        handle_csv(download_path, upload_path)
        s3_client.upload_file(upload_path, '<>'.format(bucket), key)

A path might look like: '<bucket/object>': 'input_csvs/1.csv'

Sign up to request clarification or add additional context in comments.

2 Comments

How does the path look like? For example, when running locally its "\users\....\data.csv". But how does it work with files in S3. Also, can I simply import packages such as scipy and pandas with the import command itself in Lamdba?
Unfortunately, Lambda does not support scipy or pandas out of the box. There are added steps needed for setting that environment up. For example: hackersandslackers.com/pandas-aws-lambda
2

This the OP here, and though I don't have a good answer here - I can summarize what I've figured out.

Lambda: I found a helpful youtube video to understand how to get Lambda working. Also, to use Py packages such as numpy and pandas, you'll need to add a Lambda layer. I was able to do it by going through this Medium post. But I hadn't completely figured out how to connect my input CVS files and export my output CSV file. I stopped dead in my tracks when I realized Lambda can run for a max continuous runtime of 15mins. My Markov simulation code takes 24 hours - so Lambda was out of question, and I didn't pursue further. (P.S: I read later there are some "complicated" ways to make it work - but nah - I wasn't even clear how will Lambda services get charged).

EC2: There are couple of resources that helped a lot for running my code on a EC2 AWS Linux server. A medium post on running Jupyter server was the most helpful, and then I switched to using python and conda on the terminal itself through another helpful medium post. Further, I'm using the dropbox API and python package to push my output files to the cloud from the code run on EC2.

TLDR: Lambda won't work for me, and EC2 worked largely thanks to a medium post. Also, I need to understand how CLI code works to get a better grasp of how things work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.