0

I am trying to build an automatic process that transfers aggregated data from BigQuery to redshift. Reading a lot I found that the best way to do it is like this:

BigQuery -> Export to Google cloud storage -> use gsutil to transfer into S3 -> Copy from the CSV into a table on Redshift.

I made it into a python script and it all seems to work fine when running from my PC.

But doing some tests I found out that using gsutil directly from the Cloud Shell makes the file transfer way faster. Seems like Amazon and Google have some dedicated data pipelines or something.

I am trying to move the transfer process to a google Cloud Function which I will trigger with a request(In python but the language ultimately doesnt matter as long as it works). Using subprocess and OS. Neither worked. It seems generally Shell commands from the python function doesnt seem to work.

This is the code of the Cloud Function, it works perfectly when ran manually from the Cloud Shell:

import subprocess

def hello_world(request):
    data = subprocess.call('gsutil -m cp gs://bucket/file.csv s3://bucket/',shell=True)
    print(data)
    return 'Success!'

After transfering the file i will make some S3 triggers for a lambda function that inserts in a Redshift table.

I read this: How to Transfer data from Google cloud storage to S3 without any manual activity?

But theschedueler didnt make much sense to me? Maybe I am doign somethign wrong with it. I seem to be able to make requests? But that wont solve the problem of the shell command not being executed.

If there are better alternatives I am open to them. Is it worth looking in ways to do the other way around? With a Lambda and gsutil on the AWS.

2
  • 1
    What does "it doesn't work" mean? Did it throw an exception, did it return an error, did you capture the stdout/stderr output from the subprocess and it told you something useful? Aside from that, take a look at boto3 - you might be able to connect it to GCS as a data source and S3 as a data sink. Commented Aug 16, 2019 at 17:06
  • The function executed and nothing happened. The return inside the data field was 0. Function execution started 0 Function execution took 180 ms, finished with status code: 200 Commented Aug 19, 2019 at 7:03

2 Answers 2

1

I understand you want to build an automatic process that transfers aggregated data from BigQuery to redshift.

Unfortunately gsutil doesn’t exist in the environment of Cloud Function.

Cloud Function imposes a timeout [1]. The maximum of this timeout is 9 minutes. So even gsutil is available in the environment of Cloud Function, copying large files may exceed the timeout.

If knowing which file change happens is crucial to your case, an alternative would be setting up a Cloud Pub/Sub [2] and registering object changes to it [3].

With an AppEngine or Compute Engine VM subscribing to this Cloud Pub/Sub, when a file change happens, they will receive an event indicating this change. And they can synchronize the file change to Amazon S3, either with Amazon’s API [4] (in the case of AppEngine) or gsutil-rsync (in the case of Compute Engine).

If knowing which file change happens is not crucial to your use case, setting up a cron job in a Compute Engine VM to run gsutil-rsync can achieve the same goal.

Please note that outgoing network bandwidth from Cloud Storage incurs costs [5]. You can set up a budget alert [6] to avoid being overcharged by synchronizing large files.

[1]https://cloud.google.com/functions/docs/concepts/exec#timeout

[2]https://cloud.google.com/storage/docs/pubsub-notifications

[3]https://cloud.google.com/storage/docs/reporting-changes

[4]https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

[5]https://cloud.google.com/storage/pricing#network-egress

[6]https://cloud.google.com/billing/docs/how-to/budgets

Sign up to request clarification or add additional context in comments.

1 Comment

I am marking this is as the answer because its mostly what I did in the end. I created a VM with Jenkins (on the AWS side). Installed gsutil and got the service account JSON on organization level so it has access to all projects. From then on the jenkins script simply exports a table as CSV to GCS, runs gsutil on the VM and puts the file on S3 where a lambda trigger is waiting.
0

Google cloud function is sandboxed and you can't execute shell commands. Moreover, you absolutely don't know if gsutil is installed or not, the version,.... It's the "beauty" of serverless!

However, there is 2 alternatives:

  1. Use Google Python libraries to perform your API call. If not implemented, use Discovery API. Seems complex but could work... Have a look to the 2nd solution
  2. Have a look to Cloud Run. Package your container like you want, with gcloud SDK installed and you could perform your process as is. I wrote an article on this, you could find the basis to transform function to container.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.