Python shell commands not executed from Google Cloud Functions

Question

I am trying to build an automatic process that transfers aggregated data from BigQuery to redshift. Reading a lot I found that the best way to do it is like this:

BigQuery -> Export to Google cloud storage -> use gsutil to transfer into S3 -> Copy from the CSV into a table on Redshift.

I made it into a python script and it all seems to work fine when running from my PC.

But doing some tests I found out that using gsutil directly from the Cloud Shell makes the file transfer way faster. Seems like Amazon and Google have some dedicated data pipelines or something.

I am trying to move the transfer process to a google Cloud Function which I will trigger with a request(In python but the language ultimately doesnt matter as long as it works). Using subprocess and OS. Neither worked. It seems generally Shell commands from the python function doesnt seem to work.

This is the code of the Cloud Function, it works perfectly when ran manually from the Cloud Shell:

import subprocess

def hello_world(request):
    data = subprocess.call('gsutil -m cp gs://bucket/file.csv s3://bucket/',shell=True)
    print(data)
    return 'Success!'

After transfering the file i will make some S3 triggers for a lambda function that inserts in a Redshift table.

I read this: How to Transfer data from Google cloud storage to S3 without any manual activity?

But theschedueler didnt make much sense to me? Maybe I am doign somethign wrong with it. I seem to be able to make requests? But that wont solve the problem of the shell command not being executed.

If there are better alternatives I am open to them. Is it worth looking in ways to do the other way around? With a Lambda and gsutil on the AWS.

What does "it doesn't work" mean? Did it throw an exception, did it return an error, did you capture the stdout/stderr output from the subprocess and it told you something useful? Aside from that, take a look at boto3 - you might be able to connect it to GCS as a data source and S3 as a data sink. — jarmod
– jarmod, Commented Aug 16, 2019 at 17:06
The function executed and nothing happened. The return inside the data field was 0. Function execution started 0 Function execution took 180 ms, finished with status code: 200 — Ivan Panayotov
– Ivan Panayotov, Commented Aug 19, 2019 at 7:03

Harif Velarde · Accepted Answer · 2019-08-16 21:10:32Z

1

I understand you want to build an automatic process that transfers aggregated data from BigQuery to redshift.

Unfortunately gsutil doesn’t exist in the environment of Cloud Function.

Cloud Function imposes a timeout [1]. The maximum of this timeout is 9 minutes. So even gsutil is available in the environment of Cloud Function, copying large files may exceed the timeout.

If knowing which file change happens is crucial to your case, an alternative would be setting up a Cloud Pub/Sub [2] and registering object changes to it [3].

With an AppEngine or Compute Engine VM subscribing to this Cloud Pub/Sub, when a file change happens, they will receive an event indicating this change. And they can synchronize the file change to Amazon S3, either with Amazon’s API [4] (in the case of AppEngine) or gsutil-rsync (in the case of Compute Engine).

If knowing which file change happens is not crucial to your use case, setting up a cron job in a Compute Engine VM to run gsutil-rsync can achieve the same goal.

Please note that outgoing network bandwidth from Cloud Storage incurs costs [5]. You can set up a budget alert [6] to avoid being overcharged by synchronizing large files.

[1]https://cloud.google.com/functions/docs/concepts/exec#timeout

[2]https://cloud.google.com/storage/docs/pubsub-notifications

[3]https://cloud.google.com/storage/docs/reporting-changes

[4]https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

[5]https://cloud.google.com/storage/pricing#network-egress

[6]https://cloud.google.com/billing/docs/how-to/budgets

answered Aug 16, 2019 at 21:10

Harif Velarde

7535 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ivan Panayotov Over a year ago

I am marking this is as the answer because its mostly what I did in the end. I created a VM with Jenkins (on the AWS side). Installed gsutil and got the service account JSON on organization level so it has access to all projects. From then on the jenkins script simply exports a table as CSV to GCS, runs gsutil on the VM and puts the file on S3 where a lambda trigger is waiting.

guillaume blaquiere · Accepted Answer · 2019-08-16 18:41:06Z

0

Google cloud function is sandboxed and you can't execute shell commands. Moreover, you absolutely don't know if gsutil is installed or not, the version,.... It's the "beauty" of serverless!

However, there is 2 alternatives:

Use Google Python libraries to perform your API call. If not implemented, use Discovery API. Seems complex but could work... Have a look to the 2nd solution
Have a look to Cloud Run. Package your container like you want, with gcloud SDK installed and you could perform your process as is. I wrote an article on this, you could find the basis to transform function to container.

answered Aug 16, 2019 at 18:41

guillaume blaquiere

76.6k3 gold badges65 silver badges102 bronze badges

Collectives™ on Stack Overflow

Python shell commands not executed from Google Cloud Functions

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related