PostgreSQL - Query Optimization

Question

I have this below query which takes about 15-20 secs to run.

with cte0 as (
    SELECT
        label,
        date,
        CASE
            WHEN
                Lead(label || date || "number") OVER (PARTITION BY label || date || "number" ORDER BY "label", "date", "number", "time") IS NULL
            THEN
                '1'::numeric
            ELSE
                '0'::numeric
        END As "unique"
    FROM table_data
    LEFT JOIN table_mapper ON
        table_mapper."type" = table_data."type"
    WHERE Date BETWEEN date_trunc('month', current_date - 1) and current_date - 1
)
SELECT 'MTD' as "label", round(sum("unique") / count("unique") *100,1) as "value" FROM cte0 WHERE "date" BETWEEN date_trunc('month', current_date - 1) AND current_date -1
UNION ALL
SELECT 'Week' as "label", round(sum("unique") / count("unique") *100,1) as "value" FROM cte0 WHERE "date" BETWEEN date_trunc('week', current_date - 1) AND current_date -1
UNION ALL
SELECT 'FTD' as "label", round(sum("unique") / count("unique") *100,1) as "value" FROM cte0 WHERE "date" = current_date -1

In the table table_data I have a index on date column.

CREATE INDEX ix_cli_date
  ON table_data
  USING btree
  (date);

Table Definition (`\d table_data`)

Table "public.table_data"
      Column      |          Type          | Modifiers
------------------+------------------------+-----------
 date             | date                   | not null
 number           | bigint                 | not null
 time             | time without time zone | not null
 end time         | time without time zone | not null
 duration         | integer                | not null
 time1            | integer                | not null
 time2            | integer                | not null
 time3            | integer                | not null
 time4            | integer                | not null
 time5            | integer                | not null
 time6            | integer                | not null
 time7            | integer                | not null
 type             | text                   | not null
 name             | text                   | not null
 id1              | integer                | not null
 id2              | integer                | not null
 key              | integer                | not null
 status           | text                   | not null
Indexes:
    "ix_cli_date" btree (date)

Table Definition (\d table_mapper)

 Table "public.table_mapper"
   Column   | Type | Modifiers
------------+------+-----------
 type       | text | not null
 label     | text | not null
 label2     | text | not null
 label3     | text | not null
 label4     | text | not null
 label5     | text | not null

EXPLAIN ANALYZE of the query

Result  (cost=184342.66..230332.86 rows=3 width=64) (actual time=23377.923..25695.478 rows=3 loops=1)"
  CTE cte0"
    ->  WindowAgg  (cost=121516.06..156751.65 rows=612793 width=23) (actual time=14578.000..18985.958 rows=696157 loops=1)"
          ->  Sort  (cost=121516.06..123048.04 rows=612793 width=23) (actual time=14577.975..17084.405 rows=696157 loops=1)"
                Sort Key: (((table_mapper.label || (table_data.date)::text) || (table_data."number")::text)), table_mapper.label, table_data.date, table_data."number", table_data."time""
                Sort Method: external merge  Disk: 39480kB"
                ->  Hash Left Join  (cost=11.96..37474.21 rows=612793 width=23) (actual time=1.449..3308.718 rows=696157 loops=1)"
                      Hash Cond: (table_data."type" = table_mapper."type")"
                      ->  Index Scan using ix_cli_date on table_data  (cost=0.02..29036.36 rows=612793 width=38) (actual time=0.141..946.648 rows=696157 loops=1)"
                            Index Cond: ((date >= date_trunc('month'::text, ((('now'::text)::date - 1))::timestamp with time zone)) AND (date   Hash  (cost=7.53..7.53 rows=353 width=25) (actual time=1.275..1.275 rows=336 loops=1)"
                            Buckets: 1024  Batches: 1  Memory Usage: 15kB"
                            ->  Seq Scan on table_mapper  (cost=0.00..7.53 rows=353 width=25) (actual time=0.020..0.589 rows=336 loops=1)"
  ->  Append  (cost=27591.00..73581.21 rows=3 width=64) (actual time=23377.920..25695.467 rows=3 loops=1)"
        ->  Aggregate  (cost=27591.00..27591.02 rows=1 width=32) (actual time=23377.917..23377.918 rows=1 loops=1)"
              ->  CTE Scan on cte0  (cost=0.00..27575.68 rows=3064 width=32) (actual time=14578.052..22335.236 rows=696157 loops=1)"
                    Filter: ((date = date_trunc('month'::text, ((('now'::text)::date - 1))::timestamp with time zone)))"
        ->  Aggregate  (cost=27591.00..27591.02 rows=1 width=32) (actual time=1741.509..1741.510 rows=1 loops=1)"
              ->  CTE Scan on cte0  (cost=0.00..27575.68 rows=3064 width=32) (actual time=20.009..1522.352 rows=168261 loops=1)"
                    Filter: ((date = date_trunc('week'::text, ((('now'::text)::date - 1))::timestamp with time zone)))"
        ->  Aggregate  (cost=18399.11..18399.13 rows=1 width=32) (actual time=576.029..576.030 rows=1 loops=1)"
              ->  CTE Scan on cte0  (cost=0.00..18383.79 rows=3064 width=32) (actual time=9.308..546.735 rows=23486 loops=1)"
                    Filter: (date = (('now'::text)::date - 1))"
Total runtime: 25710.506 ms"

Description :

I'm taking the unique count and repeated count from the table_data and this where LEAD helped me out where I give the value 0 for the last repeated value of a column.

Suppose I have 3 x in a column. I give 1 value to the first 2 x and the 3rd x is given 0.

Actually through a cte I'm taking the entire rows from the table table_data and doing some calculation using the lead and concatinating the strings for a defined date range where each row 1 and 0 value is defined as per the criteria.

If the lead is null it'll be counted as 1 and if it is not null then 0.

And the I return 3 rows MTD, Current Week and FTD respectively with a calculation on taking the sum() I got from the lead and the count(*) entire rows.

For MTD I have the sum and count for the current month.

For Week - It's the current week and FTD is for yesterday.

You perform this string concatenation : Lead(label || date || "number") OVER (PARTITION BY label || date || "number" ORDER BY "label", ... Only to detect the begin of groups ? Why not use the raw fields instead ? — joop
– joop, Commented Apr 28, 2014 at 11:02
I don't see a primary key in your table definition. Any other constraints or indexes missing? It's better to show what you get with \d table_data in psql instead of some hand-crafted surrogate. — Erwin Brandstetter
– Erwin Brandstetter, Commented Apr 28, 2014 at 11:05
Actually I'm a newbie. Do I need to run \d table_data in SQL Editor? And I didn't create any column for the primary column by the way. — Unknown User
– Unknown User, Commented Apr 28, 2014 at 11:07
psql is the default command line interface of PostgreSQL. Every table should have a primary key. Also, it's best not to use reserved words as identifiers. — Erwin Brandstetter
– Erwin Brandstetter, Commented Apr 28, 2014 at 11:22
Good. What's missing is a description. Please add some explanation what the query is supposed to achieve. — Erwin Brandstetter
– Erwin Brandstetter, Commented Apr 28, 2014 at 11:27

Erwin Brandstetter · Accepted Answer · 2022-03-08 03:59:23Z

2

WITH cte AS (
   SELECT d.thedate
        , lead(m.label) OVER (PARTITION BY m.label, d.thedate, d.number
                              ORDER BY d.thetime) AS leader
   FROM   table_data d
   LEFT   JOIN table_mapper m USING (type)
   WHERE  thedate BETWEEN date_trunc('month', current_date - 1)
                  AND current_date - 1
   )

SELECT 'MTD' AS label, round(count(leader)::numeric / count(*) * 100, 1) AS val
FROM   cte

UNION ALL
SELECT 'Week', round(count(leader)::numeric / count(*) * 100, 1)
FROM   cte
WHERE  thedate BETWEEN date_trunc('week', current_date - 1) AND current_date - 1

UNION ALL
SELECT 'FTD', round(count(leader)::numeric / count(*) * 100, 1)
FROM   cte
WHERE  thedate = current_date - 1;

The CTE makes sense for big tables, so you only scan it once. For smaller tables it may be faster without ...

Using thedate instead of reserved word date (in standard SQL). thetime, uni instead of time, unique. Etc.

Simplified the lead() call. You get a value or NULL for the leading row. That seems the be the only relevant information.
It's a pointless waste to repeat columns from the PARTITION clause in the ORDER BY clause of a window function.

Building on that, count(leader) / count(*) instead of sum(uni) / count(uni) is a bit faster. count(column) only counts non-null values, while count(*) counts all rows.

The condition for the first term of the UNION query was redundant.

More advice and links about data definition in the comments to the question.

Table design / Indexes

You should have primary keys. I suggest serial or IDENTITY column as surrogate PK for table_data:

ALTER TABLE table_data ADD COLUMN table_data_id serial PRIMARY KEY;

See:

Auto increment table column

Make type the primary key of table_mapper (also needed for the following FK constraint):

ALTER TABLE table_mapper ADD CONSTRAINT table_mapper_pkey (type);

Add a foreign key constraint for type to enforce referential integrity. Something like:

ALTER TABLE table_data ADD CONSTRAINT table_data_type_fkey
  FOREIGN KEY (type) REFERENCES table_mapper (type)
  ON UPDATE CASCADE ON DELETE NO ACTION;

For ultimate read performance (at some cost for writes), add a multi-column index to possibly allow index-only scans for above query:

CREATE INDEX table_data_foo_idx ON table_data (thedate, number, thetime);

edited Mar 8, 2022 at 3:59

answered Apr 28, 2014 at 12:21

Erwin Brandstetter

668k159 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Unknown User Over a year ago

It's that I need to change the date column to todate or I can just simply use it like "date" as "thedate"? And it did reduce the time to 12s.

Erwin Brandstetter Over a year ago

I suggest to use any other name for the column that is not a reserved word (or basic type name): ALTER TABLE table_data RENAME "date" TO my_new_column_name;. Details in the manual.

Unknown User Over a year ago

Thanks a lot. I'll do it.

Gordon Linoff · Accepted Answer · 2014-04-28 11:36:57Z

1

As your query is written, you are referring to the CTE three times. Instead, you can use conditional aggregation if you are willing to have the values in three columns rather than three rows:

SELECT round(sum("date" BETWEEN date_trunc('month', current_date - 1) AND current_date -1 then "unique" else 0 END)) /
             sum("date" BETWEEN date_trunc('month', current_date - 1) AND current_date -1 then 1 else 0 END)) *100,1) as mtd
     . . .
FROM CTE

This may speed up the query. In addition, you could then incorporate this logic into the CTE query itself, eliminating the materialization step as well.

answered Apr 28, 2014 at 11:36

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

1 Comment

Unknown User Over a year ago

I tried with the method you've answered me. But if I run the query. I get error syntax error at or near "then".

Collectives™ on Stack Overflow

PostgreSQL - Query Optimization

Table Definition (`\d table_data`)

EXPLAIN ANALYZE of the query

2 Answers 2

Table design / Indexes

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Table Definition (\d table_data)

EXPLAIN ANALYZE of the query

2 Answers 2

Table design / Indexes

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Table Definition (`\d table_data`)