Statistical Mode with postgres

Question

I have a table that has this schema:

create table mytable (creation_date timestamp,
                      value int,
                      category int);

I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:

select category,foo.h as h,value, count(value) from mytable, (
      select date_trunc('hour', 
                        '2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo 
      where date_part('hour',creation_date) = date_part('hour',foo.h) and
            date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
      group by category,h,value;

as result I got something like this:

category |    h     |  value  | count 
---------+----------+---------+-------
       1 | 00:00:00 |       2 |     1
       1 | 01:00:00 |       2 |     1
       1 | 02:00:00 |       2 |     6
       1 | 03:00:00 |       2 |    31
       1 | 03:00:00 |       3 |    11
       1 | 04:00:00 |       2 |    21
       1 | 04:00:00 |       3 |     9
       1 | 13:00:00 |       1 |    14
       1 | 14:00:00 |       1 |    10
       1 | 14:00:00 |       2 |     7
       1 | 15:00:00 |       1 |    52

for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.

BTW I have more than 2M records

What version of postgres are you using and what would you want to happen if there are two different values which have the same count? eg: if 1,04:00:00,3,11 was 1,04:00:00,3,21 instead? — Gary - Stand with Ukraine
– Gary - Stand with Ukraine, Commented Feb 3, 2012 at 19:42
PostgreSQL 8.4, if two values have the same count I need the lowest value. — Paco Valdez
– Paco Valdez, Commented Feb 3, 2012 at 19:46

Erwin Brandstetter · Accepted Answer · 2012-02-03 22:21:15Z

This can be simpler:

SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
       category
     , extract(hour FROM creation_date)::int AS h
     , count(*)::int AS max_ct
     , value
FROM   mytable
WHERE  extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP  BY 1,2,4
ORDER  BY 1,2,3 DESC;

Basically these are the steps:

Exclude weekends (WHERE ...). Use ISODOW to simplify the expression.
Extract hour from timestamp as h.
Group by category, h and value.
Count the rows per combination of the three; cast to integer - we don't need bigint.
Order by category, h and the highest count (DESC).
Only pick the first row (highest count) per (category, h) with the according category.

I am able to do this in one query level, because DISTINCT is applied after the aggregate function.

The result will hold no rows for any (category, h) without no entries at all. If you need to fill in the blanks, LEFT JOIN to this:

SELECT c.category, h.h
FROM   cat_tbl c
CROSS  JOIN (SELECT generate_series(0, 23) AS h) h

Gary - Stand with Ukraine · Accepted Answer · 2012-02-03 20:27:11Z

1

Given the size of your table, I'd be tempted to use your query to build a temporary table, then run a query on that to finalise the results.

Assuming you called the temporary table "summary_table", the following query should do it.

select
    category, h, value, count
from
    summary_table s1
where
    not exists 
    (select * from summary_table s2
        where s1.category = s2.category and
                  s1.h = s2.h and
                   (s1.count < s2.count 
            OR (s1.count = s2.count and s1.value > s2.value));

If you don't want to create a table, you could use a WITH clause to attach your query to this one.

with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
      select date_trunc('hour', 
                        '2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo 
      where date_part('hour',creation_date) = date_part('hour',foo.h) and
            date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
      group by category,h,value)
select
    category, h, value, count
from
    summary_table s1
where
    not exists 
    (select * from summary_table s2
        where s1.category = s1.category and
                  s1.h = s2.h and
                   (s1.count < s2.count 
            OR (s1.count = s2.count and s1.value > s2.value));

answered Feb 3, 2012 at 20:27

Gary - Stand with Ukraine

7,6291 gold badge33 silver badges41 bronze badges

1 Comment

Paco Valdez Over a year ago

Thanks, but the query didn't work, it worked if I put the second subquery in the from clause, like this:

from summary_table s1 (select * from summary_table) as s2 where where s1.category = s1.category and                   s1.h = s2.h and                    (s1.count < s2.count              OR (s1.count = s2.count and s1.value > s2.value));

Collectives™ on Stack Overflow

Statistical Mode with postgres

2 Answers 2

Basically these are the steps:

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Basically these are the steps:

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related