Incrementally importing data to a PostgreSQL database

Question

Situation:

I have a PostgreSQL-database that is logging data from sensors in a field-deployed unit (let's call this the source database). The unit has a very limited hard-disk space, meaning that if left untouched, the data-logging will cause the disk where the database is residing to fill up within a week. I have a (very limited) network link to the database (so I want to compress the dump-file), and on the other side of said link I have another PostgreSQL database (let's call that the destination database) that has a lot of free space (let's just, for argument's sake, say that the source is very limited with regard to space, and the destination is unlimited with regard to space).

I need to take incremental backups of the source database, append the rows that have been added since last backup to the destination database, and then clean out the added rows from the source database.

Now the source database might or might not have been cleaned since a backup was last taken, so the destination database needs to be able to only imported the new rows in an automated (scripted) process, but pg_restore fails miserably when trying to restore from a dump that has the same primary key numbers as the destination database.

So the question is:

What is the best way to restore only the rows from a source that are not already in the destination database?

The only solution that I've come up with so far is to pg_dump the database and restore the dump to a new secondary-database on the destination-side with pg_restore, then use simple sql to sort out which rows already exist in my main-destination database. But it seems like there should be a better way...

(extra question: Am I completely wrong in using PostgreSQL in such an application? I'm open to suggestions for other data-collection alternatives...)

Can you connect (and execute commands) from the target database into the source-database? Or do you have to batch from the source and ship (FTP, email) the transport files to the target? 2) Must the data model be exactly the same for source&target? (IMO: not) 3) how many tables are involved? — joop
– joop, Commented Nov 7, 2016 at 17:53
It is not clear what is the limiting factor in your case. Bandwith? connection time? BTW: I have two raspberri pi's working as "buffering satellites" for doing web-scraping. Each has a 120G SSD mounted (enough for months of buffering) Data transfer is initiated from the "mother-station", once per 15 min or so. The mother station imports (into TEMP tables), dedups, and inserts the new records into their final tables. Deletes on the source machines are not yet omplemented. (but could be, using a high-watermark method) — wildplasser
– wildplasser, Commented Nov 8, 2016 at 0:38
@joop: 1) Yes, I can connect and execute commands directly, 2) I can define the source DB as I prefer, so mirroring the 3) two tables involved is feasible — aright
– aright, Commented Nov 18, 2016 at 11:54
@wildplasser: Well, bandwidth will be limited, yes. Given the volatility of the link, connection time should also be kept at a minimum (I've noticed that pg_dump fails less than gracefully if the link goes down half-way through a dump). What do you mean be "high-watermark method"? Also, how do you handle the dedup?? — aright
– aright, Commented Nov 18, 2016 at 11:54

Loufylouf · Accepted Answer · 2016-11-07 16:56:46Z

A good way to start would probably be to use the --inserts option to pg_dump. From the documentation (emphasis mine) :

Dump data as INSERT commands (rather than COPY). This will make restoration very slow; it is mainly useful for making dumps that can be loaded into non-PostgreSQL databases. However, since this option generates a separate command for each row, an error in reloading a row causes only that row to be lost rather than the entire table contents. Note that the restore might fail altogether if you have rearranged column order. The --column-inserts option is safe against column order changes, though even slower.

I don't have the means to test it right now with pg_restore, but this might be enough for your case.

You could also use the fact that from the version 9.5, PostgreSQL provides ON CONFLICT DO ... for INSERTs. Use a simple scripting language to add these to the dump and you should be fine. I haven't found an option for pg_dump to add those automatically, unfortunately.

Neville Kuyt · Accepted Answer · 2016-11-07 16:58:54Z

1

You might google "sporadically connected database synchronization" to see related solutions.

It's not a neatly solved problem as far as I know - there are some common work-arounds, but I am not aware of a database-centric out-of-the-box solution.

The most common way of dealing with this is to use a message bus to move events between your machines. For instance, if your "source database" is just a data store, with no other logic, you might get rid of it, and use a message bus to say "event x has occurred", and point the endpoint of that message bus at your "destination machine", which then writes that to your database.

You might consider Apache ActiveMQ or read "Patterns of enterprise integration".

answered Nov 7, 2016 at 16:58

Neville Kuyt

29.6k2 gold badges40 silver badges52 bronze badges

Comments

wildplasser · Accepted Answer · 2016-11-20 16:49:32Z

#!/bin/sh

PSQL=/opt/postgres-9.5/bin/psql

TARGET_HOST=localhost
TARGET_DB=mystuff
TARGET_SCHEMA_IMPORT=copied
TARGET_SCHEMA_FINAL=final

SOURCE_HOST=192.168.0.101
SOURCE_DB=slurpert
SOURCE_SCHEMA=public

########
create_local_stuff()
{
${PSQL} -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG0

CREATE SCHEMA IF NOT EXISTS  ${TARGET_SCHEMA_IMPORT};
CREATE SCHEMA IF NOT EXISTS  ${TARGET_SCHEMA_FINAL};
CREATE TABLE IF NOT EXISTS  ${TARGET_SCHEMA_FINAL}.topic
        ( topic_id INTEGER NOT NULL PRIMARY KEY
        , topic_date TIMESTAMP WITH TIME ZONE
        , topic_body text
        );

CREATE TABLE IF NOT EXISTS  ${TARGET_SCHEMA_IMPORT}.tmp_topic
        ( topic_id INTEGER NOT NULL PRIMARY KEY
        , topic_date TIMESTAMP WITH TIME ZONE
        , topic_body text
        );
OMG0
}
########
find_highest()
{
${PSQL} -q -t -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG1
SELECT MAX(topic_id) FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic;
OMG1
}
########
fetch_new_data()
{
watermark=${1-0}

echo ${watermark}

${PSQL} -h ${SOURCE_HOST} -U postgres ${SOURCE_DB} <<OMG2

\COPY (SELECT topic_id, topic_date, topic_body FROM ${SOURCE_SCHEMA}.topic WHERE topic_id >${watermark}) TO '/tmp/topic.dat';
OMG2
}
########

insert_new_data()
{
${PSQL} -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG3

DELETE FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic WHERE 1=1;

COPY ${TARGET_SCHEMA_IMPORT}.tmp_topic(topic_id, topic_date, topic_body) FROM '/tmp/topic.dat';

INSERT INTO ${TARGET_SCHEMA_FINAL}.topic(topic_id, topic_date, topic_body)
SELECT topic_id, topic_date, topic_body
FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic src
WHERE NOT EXISTS (
        SELECT *
        FROM ${TARGET_SCHEMA_FINAL}.topic nx
        WHERE nx.topic_id = src.topic_id
        );
OMG3
}
########

delete_below_watermark()
{
watermark=${1-0}

echo ${watermark}

${PSQL} -h ${SOURCE_HOST} -U postgres ${SOURCE_DB} <<OMG4

-- delete not yet activated; COUNT(*) instead
-- DELETE
SELECT COUNT(*)
FROM ${SOURCE_SCHEMA}.topic WHERE topic_id <= ${watermark}
        ;
OMG4
}
######## Main

#create_local_stuff

watermark="`find_highest`"
echo 'Highest:' ${watermark}

fetch_new_data ${watermark}
insert_new_data

echo 'Delete below:' ${watermark}
delete_below_watermark ${watermark}

# Eof

This is just an example. Some notes:

I assume a non-decreasing serial PK for the table; in most cases it could also be a timestamp
for simplicity, all the queries are run as user postgres, you might need to change this
the watermark method will guarantee that only new records will be transmitted, minimising bandwidth usage
the method is atomic, if the script crashes, nothing is lost
only one table is fetched here, but you could add more
because I'm paranoid, I us a different name for the staging table and put it into a separate schema
The whole script does two queries on the remote machine (one for fetch one for delete); you could combine these.
but there is only one script (executing from the local=target machine) involved.
The DELETE is not yet active; it only does a count(*)

Collectives™ on Stack Overflow

Incrementally importing data to a PostgreSQL database

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related