AlloyDB: Batch Import of CSV Files from GCS Bucket using python

Question

We need to design a process that efficiently imports large CSV files that are created by upstream processes on a regular base into AlloyDB. We'd like to use python for this task. What is the best practice in this case?

Some considerations:

SQL's INSERT statement is way less performant than using a database specific import tool like pg_restore
While pg_restore can be executed remotely, I'd expect import performance of huge files to be significantly better when run locally on the DB server because of the saved network round trips
AlloyDB documentation says: SSH into the DB server from a container, copy over the file from GCS bucket to local and run psql COPY / pg_restore. This is not a very convenient set of actions to do programatically.

We have a similar setup with a CloudSQL postgres instance. In contrast to AlloyDB, CloudSQL offers a nice API that acts as an abstraction layer and handles the whole import of the file. By that, it takes away a lot of burden from the developer.

Have a look at this AlloyDB REST API restore in link1 & link2 — Sathi Aiswarya
– Sathi Aiswarya, Commented Apr 12, 2024 at 11:57
@SathiAiswarya: Docu says "Creates a new Cluster in a given project and location, with a volume restored from the provided source, either a backup ID or a point-in-time and a source cluster." - I need to setup a regular ingest process to an existing cluster — Thomas W.
– Thomas W., Commented Apr 13, 2024 at 12:51
I think the best solution for big files is either to use psycopg's copy_from or copy_from_expert methods directly. If you're using sqlalchemy with pyscogp, you can do it like this. This encapsulates postgres' copy command. You'll trade in edge performance for ease of implementation by sending the data over network instead of copying the file to the database server first and running it locally. By far better than using INSERT though. — Thomas W.
– Thomas W., Commented Apr 14, 2024 at 10:11
If you really need max performance, you'll most likely end up with a shell / batch script that first copies the files to DB node and then runs psql copy command locally on db server. — Thomas W.
– Thomas W., Commented Apr 14, 2024 at 10:16
maybe you can post the same as an answer so other members who are facing this similar issue are helped out. — Sathi Aiswarya
– Sathi Aiswarya, Commented Apr 15, 2024 at 6:03

Thomas W. · Accepted Answer · 2024-04-15 09:20:50Z

First of all, there is (currently) no similar abstraction layer for AlloyDB for importing files from cloud storage as for CloudSQL.

Of course you can import CSV files to AlloyDB from a bucket, but it's maybe not quite as comfortable. Depending on your needs, you can

For small files: create a script that reads in the CSV file, connects to the DB and issues SQL Insert Statements from file content. Use [postgre's multi value syntax](https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-insert-multiple-rows/. Expect significantly worse performance compared options below
For medium to large sized files using python: Some python postgres driver support postgre's COPY command. This is way faster than issuing SQL commands. Examples: pyscopg3, asyncpg. The latter can also be used in conjunction with [Google AlloyDB python connector] (https://cloud.google.com/alloydb/docs/connect-language-connectors) for ease and security of connectivity. Both kind of work with SQL alchemy.
For huge files, if you need edge performance, you might want to copy the file to the DB server first and execute postgres COPY command using psql directly on the server. Not sure though if the additional performance ever justifies for the ugliness of the approach.

Collectives™ on Stack Overflow

AlloyDB: Batch Import of CSV Files from GCS Bucket using python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related