Best patterns for database table with git semantics?

Question

We have some medium to largish tables (i.e. hundreds-of-thousdands of rows) of data that we wish to allow users to effectively "fork" similar to Git, and potentially collaborate on over time. We want them to be able to "fork" them multiple times, and make edits to different forks and compare various aggregate data between two forks.

Basically this data starts out as a bunch of read-only tables and we want to give users the ability to have a view of the table that contains their edits. The challenge is that we'd also like to periodically query rows in this table (i.e. show only rows where column3 = 'left'), and possibly even join against another table based on specific columns. (i.e. an INNER JOIN where user_table.column3 = other_table.column10) - i.e. we'd like to be able to treat these tables like fully materialized tables for relational operations.

The dumbest solution (which we do today) is simply to make full copies of the tables, but the challenge is that this is expensive at least in our current incarnation: we are using PostgreSQL, and these copy operations can take 2-20 minutes. We'd like this to be a real-time operation, like something with copy-on-write behavior.

We do record the changes the users make (i.e a log of changes) so we can eventually apply them to the "original" table but it would be nice to have a pattern, or in an ideal world, a library or storage layer, that just does this for us.

We happen to use PostgreSQL and Python today but I'm open to NoSQL systems here, as I can imagine this could result in some pretty nasty SQL if this is generalized enough. Plus we're willing to sacrifice some relational capability in order to achieve the above. Are there known patterns and/or implementations in this space? Either in PostgreSQL, or in other storage systems? Turns out this is a really hard thing to google for.

Are we talking only about adding and editing own stuff or also about changing the original? — Jakub Kania
– Jakub Kania, Commented Jan 5, 2016 at 18:50
I'm not databases expert, but this seems pretty advanced. I feel like you would need a database designed specifically for such a use pattern. Normally databases are meant to represent "truth", so having multiple versions of the same table doesn't make sense in most cases. — gardenhead
– gardenhead, Commented Jan 5, 2016 at 19:25
Is this table data versions only or does it include table structure versions? — Clodoaldo Neto
– Clodoaldo Neto, Commented Jan 5, 2016 at 19:30
So I think ultimately we kind of don't care if the original is changed - more that you can query against derivative tables... I mean eventually you could apply changes back to the original table. Regarding data vs structure - this is just data, structure would remain constant — alecf
– alecf, Commented Jan 6, 2016 at 16:42

Jakub Kania · Accepted Answer · 2016-01-11 22:57:50Z

1

It's certainly is possible to do something with it although the results may (or may not) be flaky depending on how many other hacks you'll be using in your database.

If we add the columns owner and deleted to the source table, create a view, add INSTEAD OF triggers to the view and grant the users right only to the view and not the source table we get this:

CREATE SEQUENCE source_seq;
CREATE TABLE source
(
    id INT DEFAULT nextval('source_seq')
    ,value VARCHAR
    ,owner name DEFAULT session_user
    ,deleted boolean DEFAULT FALSE
);
CREATE VIEW source_emp AS
    SELECT id, value
    FROM source AS s1
    WHERE ((owner IS NULL AND NOT EXISTS (SELECT * FROM source AS s2 WHERE s1.id = s2.id AND s2.owner = session_user )) OR owner = session_user)
    AND NOT deleted
CREATE OR REPLACE FUNCTION source_change()
RETURNS TRIGGER
LANGUAGE plpgsql
SECURITY DEFINER
AS $function$
   BEGIN      
      IF TG_OP = 'UPDATE' THEN
       INSERT INTO source(id,value,owner,deleted) 
       SELECT NEW.id,NEW.value,session_user, FALSE
       WHERE NOT EXISTS(SELECT * FROM source WHERE owner = session_user AND id = OLD.id);
       UPDATE source SET id = NEW.id, value = NEW.value, owner = session_user WHERE owner = session_user AND id = OLD.id;              
       RETURN NEW;
       ELSIF TG_OP = 'DELETE' THEN
       INSERT INTO source(id,value,owner,deleted) 
       SELECT OLD.id,NULL,session_user, TRUE
       WHERE NOT EXISTS(SELECT * FROM source WHERE owner = session_user AND id = OLD.id);
       UPDATE source SET value = NULL, deleted = TRUE WHERE owner = session_user AND id = OLD.id;
       RETURN NULL;
      END IF;
      RETURN NEW;
    END;
$function$;

CREATE TRIGGER source_trig
    INSTEAD OF UPDATE OR DELETE ON
      source_emp FOR EACH ROW EXECUTE PROCEDURE source_change();

Now if the user tries to:

INSERT : he gets records that are only visible to him on the view
UPDATE : he gets a copy of the original records and edits those or updates his copies
DELETE : he gets the source marked as deleted only for his personal use.

If you don't want to touch the original table you can just create a new one with those additional two columns and alter the view so it unions with the original table.

answered Jan 11, 2016 at 22:57

Jakub Kania

16.6k2 gold badges44 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alecf Over a year ago

This is incredibly helpful, and gives me much greater appreciation for what PostgreSQL can do. (I'm not quite ready to mark this as "the" answer, I was hoping to get higher level patterns than this, but this is certainly pretty great!)

alecf Over a year ago

Well given enough time, it seems like this is the only practical answer right now.

Collectives™ on Stack Overflow

Best patterns for database table with git semantics?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related