2

I have a small daily computing job that imports data from BigQuery, uses Python numerical computing libraries (pandas, numpy) to process and then write results to an external table (Firestore or MySQL at another project)

What is the recommended way to deploy it on GCP?

Our devops advice us against creating a single vm just for doing batch job. They would prefer not to manage the VM infrastructure themselves, and there should be services born to support batch job. They insist that I use Dataflow. But I think Dataflow distributed nature is a little bit overkill.

Many thanks,


Updated October 14, 2019:

I'm thinking about dockerizing the batch job and deploy to a K8 cluster. The downside is that the cluster should host several jobs to worth the setup and maintaining effort. Can someone give me advice on the feasibility and suitability of this approach?


Updated October 15, 2019:

Thanks Alex Titov for his comment at https://googlecloud-community.slack.com/archives/C0G6VB4UE/p1571032864020000. Based on his suggestion, I'm going to break my job into multiple small Cloud Functions components and chain them together as a pipeline by Cloud Scheduler and/or Cloud Composer.

2
  • What are your requirement? Constraint? How many memory do you need? How long is your run? what's the retry policy? There is a lot a solution, but according with this element, the right one could be advice! Commented Oct 13, 2019 at 0:54
  • thanks @guillaumeblaquiere. The size of the batch job is well fitted in a VM with Python and its data processing and machine learning libraries installed. The run duration can be up to 1-2 hour. Retry can be handled at application logic, while failure at spawning the VM should send an alert to the job owner. Commented Oct 14, 2019 at 3:20

2 Answers 2

2

Cloud Dataflow does exactly what you are looking for, so it's much easier to manage, scale and build than a VM. Ask yourself only a few questions beforehand and if they don't apply, use Dataflow:

  • Do I want to be restricted to a specific Cloud Provider (GCP in this case)
  • In this project, are other cloud services used or they just use infrastructure from the Cloud (keeping consistency). Also in what direction do we want the project to go? (use custom or cloud solutions)
  • Do I want absolute control of this batch software processing tool? If so, you may not have it with Dataflow
  • Other considerations, like cost, deployment time, ramp-up time

If all answers incline towards cloud service, then use that.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Horatiu Jeflea: - Here we only choose Google as our cloud service provider. - The direction is to easily deploy many small batch jobs like that (our analysts find the input size much more convenient to work with in Python pandas, numpy instead of Apache Beam or even Spark) - Run duration can be up to 1-2 hour, cost is definitely a concern but now we need options first
0

If you containerize your job, there is 2 serverless solutions for running it. A day, a 3rd will be available when Cloud Run could last more than 15 minutes (in roadmap but without release date)

  1. Use Cloud Build. Think to set correctly the timeout. In fact, Cloud Build is design for running any container. I wrote an article on this

  2. Use AI-Platform. A (great) Googler has released an article on this

Both solution are great and you can choose the machine type of underlying VM which run your container. Thanks to this, you don't have to manage a K8S cluster and pay for it when it isn't used.

1 Comment

Thanks @guillaumeblaquiere, our devops say that they're gonna take a look on Cloud Build.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.