Per these AWS Amazon RDS docs, it looks like AWS offers an aws_s3 PostgreSQL extension for transferring data from S3 to Postgres in RDS.
We're using airflow to orchestrate our data ingestion pipelines, and it would be great if there was a python solution here. I have little experience with PostgreSQL and I've never used any PostgreSQL extensions, and being able to move data around using python is going to help us a ton. For the time being, we are avoiding AWS tools such as AWS Data Pipeline and AWS Glue in favor of building our own architecture with python and airflow.
For reference, we have the following for our GCP architecture for ingesting data from GCS into BigQuery using python:
from google.cloud import bigquery
# create BiqQuery client object + load job config
client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
schema=None, # autodetech for now
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON, # use ndjson
write_disposition=bigquery.WriteDisposition.WRITE_APPEND, # append to existing
autodetect=True
)
# and load into Bigquery
table_id = "our_gcp_project.our_model.our_table"
gcs_uri = "gs://our_bucket/path-to-our/file.json"
load_job = client.load_table_from_uri(gcs_uri, table_id, job_config=job_config) # location="US" # Make an API request.
load_job.result() # Waits for the job to complete
# check for success
destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))
We're pretty much looking to port this code from GCS/BigQuery into S3/Postgres RDS, and want to get started in the right direction.