0

I have a tabler i will call data_rows like this:

create table if not exists data_rows
(
    id                integer  not null,
    constraint data_rows_to_group
        primary key (id),
    date              date     not null,
    group_id          int, 
    --more fields that are not relevant
);

When i order the rows by date i want the rows to have a new group_id if the date difference to the preceding row is >7 days (can be another time_span but lets keep it at 7 days) So all rows that have the same group_id when ordered by date have a date differences <= 7 days. For example:

id      date        group id
1      12.01.2019   0
2      15.01.2019   0
3      21.01.2019   0
4      05.02.2019   1
5      08.02.2019   1
6      20.02.2019   2
7      30.02.2019   3
8      30.02.2019   3

(Especially 1 and 3 are in the same group although they have a difference >7 but in the group two following rows have no difference >7)

I know how to do this in python or c# or similar languages in a procedural way. But it would be very useful if i could do this on the postgresql server because it is a lot of data and it keeps it to a single point of failure too and it would be a big learning experience too.

Here is how i would do it in c# so you get the idea of what i want:

using System;
using System.Collections.Generic;
using System.Linq;

class DataRows
{
    public int Id { get; set; }
    public DateTime Date { get; set; }
    public int GroupId { get; set; }
}

class GroupMarking
{
    public DataRows[] RowsWithGroupIds(IEnumerable<DataRows> relevantDataRows, TimeSpan betweenSpan)
    {
        var currentGroupId = 0;
        var rows = relevantDataRows.OrderBy(p => p.Date).ToArray();
        rows[0].GroupId = currentGroupId;
        for (var i = 1; i < rows.Length; i++)
        {
            if (rows[i].Date -
                rows[i - 1].Date >= betweenSpan)
            {
                currentGroupId++;
            }
            rows[i].GroupId = currentGroupId;
        }
        return rows;
    }
}

Is this possible in postgresql? I know there are Loops in Postgres. I prefer a solution without loops but if its not possible without they are ok. How do i create the ids int the group_id column without falling back on a procedural language?

5
  • You should try to do this at several iterations, on each iteration you will add one of your business rules to the SQL query. For example, first try to define a formula which will calculate the group ID from any given date - by using the provided date_span argument. Then you will try to add the rule for groups of no less than 2 rows. Then the next rule ... until you come up with a final SQL query. If it is impossible to build such a query - then you can simply write an imperative SQL procedure directly translating C# to SQL. Commented May 18, 2020 at 15:04
  • @IVOGELOV i tried to clarify the question a bit more. I added those extra rules as context but they are not the problem. My roblem is the grouping by date differences Commented May 18, 2020 at 15:45
  • Well, you should start with a definition of this date_span - at least to prevent ambiguity like this: If there are 3 rows in sequence and row 2 is within the date_span relative to both row 1 and row 3 - then which of these 2 groups should we put row 2 in ? Implementation comes from the definition. Commented May 19, 2020 at 7:01
  • @IVOGELOV changed the question a lot to make more clear what i need Commented May 19, 2020 at 8:11
  • Unrelated, but: 30.02.2019 is an invalid date Commented May 19, 2020 at 8:16

1 Answer 1

2

This is a problem which can be solved by turning the information if the difference is bigger than 7 days into a flag, and then summing that flag:

select id, "date", sum(flag) over (order by "date") as group_id
from (
  select id, "date", 
         ("date" - lag("date", 1, "date") over (order by "date") > 7)::int as flag
  from data_rows
) t
order by "date"       

The expression "date" - lag("date", 1, "date") over (order by "date") calculates the difference in dates between the "current" row and the previous one. This is then checked if it's greater than 7 days and the boolean is converted to an integer (0, 1) so that the outer running sum can be used on it.

Online example

(I replaced the invalid date 2019-02-30 with 2019-02-28)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.