1

I've got two tables linked by a common ID column like here:

CREATE TABLE IF NOT EXISTS names (
    uid BIGSERIAL,
    name VARCHAR(255) NOT NULL,
    PRIMARY KEY (uid)
);
CREATE TABLE IF NOT EXISTS texts (
    name_uid BIGINT NOT NULL REFERENCES names,
    timestamp TIMESTAMP NOT NULL,
    some_value TEXT NULL
);

And here we've got some data to play around:

INSERT INTO names VALUES ( 0, '1/a' );
INSERT INTO names VALUES ( 1, '1/b' );
INSERT INTO names VALUES ( 2, '2/c' );
INSERT INTO names VALUES ( 3, '3/d' );
INSERT INTO names VALUES ( 4, '3/e' );
INSERT INTO names VALUES ( 5, '3/f' );
INSERT INTO texts VALUES ( 0, '2018-01-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 1, '2018-01-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 2, '2018-02-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 2, '2018-02-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 3, '2018-03-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 3, '2018-06-01 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 4, '2018-06-02 00:00:00', 'text...' );
INSERT INTO texts VALUES ( 5, '2018-06-03 00:00:00', 'text...' );

What I need now is to apply the following logical rules

  • select names.uid and names.name based on a SIMILAR TO pattern on the column name in table name and group them by their prefix
  • for the selected rows from names: get the newest timestamp entry from texts (regardless of when it was)
  • for the selected rows from names: count the correspond rows with a specific name prefix in the table text which are after a specific date

This can be achieved with the following query:

SELECT substring(names.name, '[^/]+' ) AS name_prefix, COALESCE( sum( text_counts.count ), 0) AS counter, max(text_timestamps.timestamp) AS timestamp
FROM names
LEFT JOIN (
    SELECT texts.name_uid, count(*)
    FROM texts
    WHERE texts.timestamp > '2018-05-01 00:00:00'
    GROUP BY texts.name_uid
) text_counts ON text_counts.name_uid = names.uid
LEFT JOIN(
    SELECT texts.name_uid, max(texts.timestamp) AS timestamp
    FROM texts
    GROUP BY texts.name_uid
) text_timestamps ON text_timestamps.name_uid = names.uid
WHERE names.name SIMILAR TO '1%|3%'
GROUP BY name_prefix

However, this query is quite slow. So I tried to come up with a better solution, but failed so far. What I've got is this:

SELECT name_info.name_prefix, count(*) AS counter, max(timestamp) AS timestamp
FROM texts
RIGHT JOIN (
    SELECT names.uid, substring(names.name, '[^/]+' ) AS name_prefix
    FROM names
    WHERE names.name SIMILAR TO '1%|3%'
) name_info ON name_info.uid = texts.name_uid
WHERE texts.timestamp > '2018-05-01 00:00:00'
GROUP BY name_info.name_prefix

Compared to the fist solution, this is very fast. The problem is, that now rows with a count of zero are missing form the result.

My question now is how craft a query that offers a performance close to query 2 two but includes the rows with a count of zero in the result

Some contextual information: I'm working with PostgreSQL 10 and the table texts has about a million times more rows than the table names. In fact, texts is even partitioned in the real world, but I decided to skip this for the example here.

1 Answer 1

1

The right join in the second query acts like an inner join because of the timestamp condition in the WHERE clause. Remove the condition and use the count(*) aggregate with FILTER:

SELECT 
    name_info.name_prefix, 
    count(*) FILTER (WHERE texts.timestamp > '2018-05-01 00:00:00') AS counter, 
    max(timestamp) AS timestamp
FROM texts
RIGHT JOIN (
    SELECT names.uid, substring(names.name, '[^/]+' ) AS name_prefix
    FROM names
    WHERE names.name SIMILAR TO '1%|3%'
    ) name_info ON name_info.uid = texts.name_uid 
GROUP BY name_info.name_prefix;

DbFiddle.

You can also try two-stage grouping, e.g.:

select 
    name_prefix, 
    sum(counter) as counter, 
    max(timestamp) as timestamp
from (
    select 
        substring(name, '[^/]+' ) as name_prefix,
        sum((timestamp > '2018-05-01 00:00:00')::int) as counter,
        max(timestamp) as timestamp
    from texts
    join names on name_uid = uid
    where name similar to '1%|3%'
    group by uid
    ) s
group by name_prefix
Sign up to request clarification or add additional context in comments.

2 Comments

Now I see my issue with the WHERE-cause. Thanks for pointing that out. On the DbFiddle this gets a query cost of about 47 compared to the 74 of my original attempt. Weirdly, on the real-word data your solution gets a query cost of about 5 million compared to about 3 million of my original query. I have to investigate this further. Sadly I cannot share the real word data due do confidentiality reasons.
Of course, large data tables may have dependencies that affect query performance, which cannot be found on sample data. See the updated answer with an alternative solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.