6

I wanted to migrate from BigQuery to CloudSQL to save cost. My problem is that CloudSQL with PostgreSQL is very very slow compare to BigQuery. A query that takes 1.5 seconds in BigQuery takes almost 4.5 minutes(!) on CloudSQL with PostgreSQL.

I have CloudSQL with PostgreSQL server with the following configs:

enter image description here

My database have a main table with 16M rows (around 14GB in RAM).

A example query:

EXPLAIN ANALYZE 
  SELECT 
    "title"
  FROM 
    public.videos
  WHERE 
      EXISTS (SELECT 
                * 
              FROM (
                    SELECT 
                      COUNT(DISTINCT CASE WHEN LOWER(param) LIKE '%thriller%' THEN '0'
                                          WHEN LOWER(param) LIKE '%crime%' THEN '1' END) AS count
                    FROM
                      UNNEST(categories) AS param
                    ) alias
                        WHERE count = 2)

  ORDER BY views DESC 

  LIMIT 12 OFFSET 0

The table is a videos tables with categories column as text[]. The search condition here looks where there is a categories which is like '%thriller%' and like '%crime%' exactly two times

The EXPLAIN ANALYZE of this query gives this output (CSV): link. The EXPLAIN (BUFFERS) of this query gives this output (CSV): link.

Query Insights graph:

enter image description here

Memory profile:

enter image description here

BigQuery reference for the same query on the same table size:

enter image description here

Server config: link.

Table describe: link.

My goal is to have Cloud SQL with the same query speed as Big Query

10
  • 2
    Please show us the table definition, including the indexes. Is there any chance you could normalize your categories? Commented Nov 11, 2021 at 0:09
  • Sorry I cant show it due to IP issues. No indexes. Commented Nov 11, 2021 at 0:12
  • Does this answer your question? stackoverflow.com/a/63585216/16531380 Commented Nov 11, 2021 at 8:42
  • You should turn on track_io_timing, then run EXPLAIN (ANALYZE, BUFFERS) and post the results as properly formatted to text (not as images), preserving indentation. Does BigQuery also have something equivalent to the execution plan you could show? Commented Nov 11, 2021 at 11:41
  • @RJC thanks for the link. My database is not that big, only 16M rows. I agree that BQ is suited for really big tables but I dont think its my case Commented Nov 11, 2021 at 17:19

6 Answers 6

1

For anyone coming here wondering how to tune their postgres machine on cloud sql they call it flags and you can do it from the UI although not all the config options are edit able.

https://cloud.google.com/sql/docs/postgres/flags#console

Sign up to request clarification or add additional context in comments.

Comments

1

The initial query looks overcomplicated. It could be rewritten as:

SELECT  v."title"
FROM public.videos v
WHERE array_to_string(v.categories, '^') ILIKE ALL (ARRAY['%thriller%', '%crime%'])
ORDER BY views DESC 
LIMIT 12 OFFSET 0;

db<>fiddle demo

1 Comment

Thanks for the answer. I have tried your solution on my db and its still very very slow. 2.5 minutes to finish. This the EXPLAIN ANALYZE: ufile.io/ahk9qoem I want it to run under 2 seconds like in big query
0

PostGreSQL is very slow by design on every queries involving COUNT aggregate function and there is absolutly nothing to do except materialized view to enforces the performances.

The tests I have made on my machine with 48 cores about COUNT performances compare from PostGreSQL to MS SQL Server is clear : SQL Server is between 61 and 561 times faster in all situations, and with columnstore index SQL Server can be 1,533 time faster…

The same speed is reached when using any other RDBMS. The explanation is clearly the PG MVCC that maintain ghost rows inside table and index pages, that needs to browse every rows to know if it is an active or ghost row... In all the other RDBMS, the count is done by reading only one information at the top of the page (number of rows in the page) and also by using parallelized access or in SQL Server a batch access and not a row access...

There is nothing to do to speed up the count in PG until the storage engine will not been enterely rewrite to avoid ghost slots inside pages...

2 Comments

Thanks for the answer, I tried mysql and its still very very slow. I was using JSON object because there is not array type in mysql
Arrays and stuff like arrays (JSON, XML, records, multisets...) are opposite to the basics principles of relational model, that induces that data must be atomic. And when it is atomic, it is possible to speed up with indexes. Ecepts in specifice database like Oracle or Microsoft SQL Server that can indexes XML, no other RDBMS knows how to speed up access with indexes on composite structures like ARRAY ! By the way a table is very very close to array and can be indexed...
0

I believe you need to use a full-text search and the special GIN index. The steps:

  1. Create the helper function for index: CREATE OR REPLACE FUNCTION immutable_array_to_string(text[]) RETURNS text as $$ SELECT array_to_string($1, ','); $$ LANGUAGE sql IMMUTABLE;

  2. Create index itself: CREATE INDEX videos_cats_fts_idx ON videos USING gin(to_tsvector('english', LOWER(immutable_array_to_string(categories))));

Use the following query: SELECT title FROM videos WHERE (to_tsvector('english', immutable_array_to_string(categories)) @@ (to_tsquery('english', 'thriller & crime'))) limit 12 offset 0;

Be aware that this query has a different meaning for 'crime' and 'thriller'. They are not just substrings. They are tokens in English phrases. But it looks that actually it is better for your task. Also, this index is not good for frequently changed data. It should work fine when you have mostly read-only data.

PS This answer is inspired by answer & comments: https://stackoverflow.com/a/29640418/159923

Comments

0

Apart from the sql syntax optimization, have you tried Postgresql tune?

I check the explaination has found only two workers in parallel and 25KMemory used in sorting.

Workers Planned: 2" Sort Method: quicksort Memory: 25kB"

For your query, it is typical OLAP query. it performance usually related the memory(memory and cpu cores used(workers). The default postgres use KB level memory and few workers. You can tune your postgresql.conf to optimized it work as OLAP type database.

=================================================== Here is my suggestion: use more memory(9MB as work mem ) and more cpu(max 16)

# DB Version: 13
# OS Type: linux
# DB Type: dw
# Total Memory (RAM): 24 GB
# CPUs num: 16
# Data Storage: ssd

max_connections = 40
shared_buffers = 6GB
effective_cache_size = 18GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 9830kB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 16
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
max_parallel_maintenance_workers = 4

You can add it to you postgresql.conf last line. And restart your postgresql server to make it effect.

To further optimization,

  1. reduce the connection and increase the work_mem. 200* 9830 is about 2GB memory for all connections. If you has less( for example, 100) connections, you can get more memory for query working.

==================================== Regarding using text array type and unnest. you can try to add proper index.

That's all, good luck.

WangYong

1 Comment

The question is about PostgreSQL on Google Cloud SQL - there is no direct access to postgresql.conf. effective_io_concurrency, max_parallel_workers_per_gather, max_parallel_workers, max_parallel_maintenance_workers settings are not made available through instance customisation flags.
0

You can modify the performance parameter settings under Flags section Ex.enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.