I am trying to build an automatic process that transfers aggregated data from BigQuery to redshift. Reading a lot I found that the best way to do it is like this:
BigQuery -> Export to Google cloud storage -> use gsutil to transfer into S3 -> Copy from the CSV into a table on Redshift.
I made it into a python script and it all seems to work fine when running from my PC.
But doing some tests I found out that using gsutil directly from the Cloud Shell makes the file transfer way faster. Seems like Amazon and Google have some dedicated data pipelines or something.
I am trying to move the transfer process to a google Cloud Function which I will trigger with a request(In python but the language ultimately doesnt matter as long as it works). Using subprocess and OS. Neither worked. It seems generally Shell commands from the python function doesnt seem to work.
This is the code of the Cloud Function, it works perfectly when ran manually from the Cloud Shell:
import subprocess
def hello_world(request):
data = subprocess.call('gsutil -m cp gs://bucket/file.csv s3://bucket/',shell=True)
print(data)
return 'Success!'
After transfering the file i will make some S3 triggers for a lambda function that inserts in a Redshift table.
I read this: How to Transfer data from Google cloud storage to S3 without any manual activity?
But theschedueler didnt make much sense to me? Maybe I am doign somethign wrong with it. I seem to be able to make requests? But that wont solve the problem of the shell command not being executed.
If there are better alternatives I am open to them. Is it worth looking in ways to do the other way around? With a Lambda and gsutil on the AWS.