PostgreSQL: select count and maximum from two tables

Question

I've got two tables linked by a common ID column like here:

CREATE TABLE IF NOT EXISTS names (
    uid BIGSERIAL,
    name VARCHAR(255) NOT NULL,
    PRIMARY KEY (uid)
);
CREATE TABLE IF NOT EXISTS texts (
    name_uid BIGINT NOT NULL REFERENCES names,
    timestamp TIMESTAMP NOT NULL,
    some_value TEXT NULL
);

And here we've got some data to play around:

INSERT INTO names VALUES ( 0, '1/a' );
INSERT INTO names VALUES ( 1, '1/b' );
INSERT INTO names VALUES ( 2, '2/c' );
INSERT INTO names VALUES ( 3, '3/d' );
INSERT INTO names VALUES ( 4, '3/e' );
INSERT INTO names VALUES ( 5, '3/f' );
INSERT INTO texts VALUES ( 0, '2018-01-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 1, '2018-01-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 2, '2018-02-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 2, '2018-02-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 3, '2018-03-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 3, '2018-06-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 4, '2018-06-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 5, '2018-06-03 00:00:00', 'text...' );

What I need now is to apply the following logical rules

select names.uid and names.name based on a SIMILAR TO pattern on the column name in table name and group them by their prefix
for the selected rows from names: get the newest timestamp entry from texts (regardless of when it was)
for the selected rows from names: count the correspond rows with a specific name prefix in the table text which are after a specific date

This can be achieved with the following query:

SELECT substring(names.name, '[^/]+' ) AS name_prefix, COALESCE( sum( text_counts.count ), 0) AS counter, max(text_timestamps.timestamp) AS timestamp
FROM names
LEFT JOIN (
    SELECT texts.name_uid, count(*)
    FROM texts
    WHERE texts.timestamp > '2018-05-01 00:00:00'
    GROUP BY texts.name_uid
) text_counts ON text_counts.name_uid = names.uid
LEFT JOIN(
    SELECT texts.name_uid, max(texts.timestamp) AS timestamp
    FROM texts
    GROUP BY texts.name_uid
) text_timestamps ON text_timestamps.name_uid = names.uid
WHERE names.name SIMILAR TO '1%|3%'
GROUP BY name_prefix

However, this query is quite slow. So I tried to come up with a better solution, but failed so far. What I've got is this:

SELECT name_info.name_prefix, count(*) AS counter, max(timestamp) AS timestamp
FROM texts
RIGHT JOIN (
    SELECT names.uid, substring(names.name, '[^/]+' ) AS name_prefix
    FROM names
    WHERE names.name SIMILAR TO '1%|3%'
) name_info ON name_info.uid = texts.name_uid
WHERE texts.timestamp > '2018-05-01 00:00:00'
GROUP BY name_info.name_prefix

Compared to the fist solution, this is very fast. The problem is, that now rows with a count of zero are missing form the result.

My question now is how craft a query that offers a performance close to query 2 two but includes the rows with a count of zero in the result

Some contextual information: I'm working with PostgreSQL 10 and the table texts has about a million times more rows than the table names. In fact, texts is even partitioned in the real world, but I decided to skip this for the example here.

klin · Accepted Answer · 2018-06-18 10:51:34Z

1

The right join in the second query acts like an inner join because of the timestamp condition in the WHERE clause. Remove the condition and use the count(*) aggregate with FILTER:

SELECT 
    name_info.name_prefix, 
    count(*) FILTER (WHERE texts.timestamp > '2018-05-01 00:00:00') AS counter, 
    max(timestamp) AS timestamp
FROM texts
RIGHT JOIN (
    SELECT names.uid, substring(names.name, '[^/]+' ) AS name_prefix
    FROM names
    WHERE names.name SIMILAR TO '1%|3%'
    ) name_info ON name_info.uid = texts.name_uid 
GROUP BY name_info.name_prefix;

DbFiddle.

You can also try two-stage grouping, e.g.:

select 
    name_prefix, 
    sum(counter) as counter, 
    max(timestamp) as timestamp
from (
    select 
        substring(name, '[^/]+' ) as name_prefix,
        sum((timestamp > '2018-05-01 00:00:00')::int) as counter,
        max(timestamp) as timestamp
    from texts
    join names on name_uid = uid
    where name similar to '1%|3%'
    group by uid
    ) s
group by name_prefix

edited Jun 18, 2018 at 10:51

answered Jun 14, 2018 at 19:54

klin

123k15 gold badges240 silver badges262 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user711270 Over a year ago

Now I see my issue with the WHERE-cause. Thanks for pointing that out. On the DbFiddle this gets a query cost of about 47 compared to the 74 of my original attempt. Weirdly, on the real-word data your solution gets a query cost of about 5 million compared to about 3 million of my original query. I have to investigate this further. Sadly I cannot share the real word data due do confidentiality reasons.

klin Over a year ago

Of course, large data tables may have dependencies that affect query performance, which cannot be found on sample data. See the updated answer with an alternative solution.

Collectives™ on Stack Overflow

PostgreSQL: select count and maximum from two tables

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related