Deploy small batch job using python numpy and pandas on GCP

Question

I have a small daily computing job that imports data from BigQuery, uses Python numerical computing libraries (pandas, numpy) to process and then write results to an external table (Firestore or MySQL at another project)

What is the recommended way to deploy it on GCP?

Our devops advice us against creating a single vm just for doing batch job. They would prefer not to manage the VM infrastructure themselves, and there should be services born to support batch job. They insist that I use Dataflow. But I think Dataflow distributed nature is a little bit overkill.

Many thanks,

Updated October 14, 2019:

I'm thinking about dockerizing the batch job and deploy to a K8 cluster. The downside is that the cluster should host several jobs to worth the setup and maintaining effort. Can someone give me advice on the feasibility and suitability of this approach?

Updated October 15, 2019:

Thanks Alex Titov for his comment at https://googlecloud-community.slack.com/archives/C0G6VB4UE/p1571032864020000. Based on his suggestion, I'm going to break my job into multiple small Cloud Functions components and chain them together as a pipeline by Cloud Scheduler and/or Cloud Composer.

What are your requirement? Constraint? How many memory do you need? How long is your run? what's the retry policy? There is a lot a solution, but according with this element, the right one could be advice! — guillaume blaquiere
– guillaume blaquiere, Commented Oct 13, 2019 at 0:54
thanks @guillaumeblaquiere. The size of the batch job is well fitted in a VM with Python and its data processing and machine learning libraries installed. The run duration can be up to 1-2 hour. Retry can be handled at application logic, while failure at spawning the VM should send an alert to the job owner. — Quy Dinh
– Quy Dinh, Commented Oct 14, 2019 at 3:20

Horatiu Jeflea · Accepted Answer · 2019-10-12 07:42:32Z

2

Cloud Dataflow does exactly what you are looking for, so it's much easier to manage, scale and build than a VM. Ask yourself only a few questions beforehand and if they don't apply, use Dataflow:

Do I want to be restricted to a specific Cloud Provider (GCP in this case)
In this project, are other cloud services used or they just use infrastructure from the Cloud (keeping consistency). Also in what direction do we want the project to go? (use custom or cloud solutions)
Do I want absolute control of this batch software processing tool? If so, you may not have it with Dataflow
Other considerations, like cost, deployment time, ramp-up time

If all answers incline towards cloud service, then use that.

answered Oct 12, 2019 at 7:42

Horatiu Jeflea

7,4747 gold badges42 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Quy Dinh Over a year ago

Thanks @Horatiu Jeflea: - Here we only choose Google as our cloud service provider. - The direction is to easily deploy many small batch jobs like that (our analysts find the input size much more convenient to work with in Python pandas, numpy instead of Apache Beam or even Spark) - Run duration can be up to 1-2 hour, cost is definitely a concern but now we need options first

guillaume blaquiere · Accepted Answer · 2019-10-14 09:13:15Z

0

If you containerize your job, there is 2 serverless solutions for running it. A day, a 3rd will be available when Cloud Run could last more than 15 minutes (in roadmap but without release date)

Use Cloud Build. Think to set correctly the timeout. In fact, Cloud Build is design for running any container. I wrote an article on this
Use AI-Platform. A (great) Googler has released an article on this

Both solution are great and you can choose the machine type of underlying VM which run your container. Thanks to this, you don't have to manage a K8S cluster and pay for it when it isn't used.

answered Oct 14, 2019 at 9:13

guillaume blaquiere

76.6k3 gold badges65 silver badges102 bronze badges

1 Comment

Quy Dinh Over a year ago

Thanks @guillaumeblaquiere, our devops say that they're gonna take a look on Cloud Build.

Collectives™ on Stack Overflow

Deploy small batch job using python numpy and pandas on GCP

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related