5

I have a large table with ca. 10 million rows in PostgreSQL 9.4 database. It looks somewhat like this:

gid | number1 | random | result | ...
 1  |    2    |  NULL  |  NULL  | ...
 2  |   15    |  NULL  |  NULL  | ...
... |   ...   |   ...  |  ...   | ...

Now I would like to update the columns random and result as a function of number1. That means that at least random needs to be produced in a script outside of the database. Since I have limited RAM, I wonder how I could do that efficiently using psycopg2. I believe I face two problems: How to fetch the data without using too much RAM and how to get it back in there. The simpleton approach would look like this:

curs.execute("""SELECT gid1, number1 FROM my_table;""")
data = curs.fetchall()

result = []
for i in data:
    result.append((create_random(i[1]), i[0]))
curs.executemany("""UPDATE my_table
                    SET random = %s
                    WHERE gid = %s;""",
                 results)
curs.execute("""UPDATE my_table
                SET result = number1 * random;""")

However, this will certainly quickly deplete all my memory and take forever to UPDATE my_table.

What would be a smarter strategy? The database is being accessed exclusively and can be locked. The PostgreSQL random function is unfortunately not suitable for my case.

4
  • Define “to much RAM”, because 10 million * 3 integers is 240MB, which is more than manageable by many systems. Also, could your generating logic be implemented as a stored procedure? Commented Oct 21, 2015 at 10:16
  • Can you use (PL/Python)[postgresql.org/docs/9.4/static/plpython.html]? If so you can write your own replacement for Postgres's random function and run it all as one UPDATE statement inside the database. Commented Oct 21, 2015 at 10:40
  • @foz Oh - interesting! But I assume this wouldn't work in my case because I need numpy packages...? Commented Oct 21, 2015 at 11:38
  • @n1000 well I haven't tried it myself, but there is an example here that imports numpy inside a PL/Python function: github.com/ihuston/plpython_examples/blob/master/… Commented Oct 21, 2015 at 14:19

1 Answer 1

5

unnest the data to make it all at once:

def create_random(i):
    return random() * i

curs.execute("select gid, number from t;")
data = curs.fetchall()

results = []
for i in data:
    results.append((create_random(i[1]), i[0]))

curs.execute("""
    update t
    set
        rnd = s.rnd,
        result = number * s.rnd
    from unnest(%s) s(rnd numeric, gid integer)
    where t.gid = s.gid;
""", (results,))

con.commit()

Table t:

create table t (
    gid integer,
    number integer,
    rnd float,
    result float
);
Sign up to request clarification or add additional context in comments.

2 Comments

Wow - the UPDATE is 500% faster than using executemany. Thanks! So you think fetching all into RAM and putting it back with unnest is OK otherwise? I'm just afraid I might run out of RAM for larger tables and more columns. No need for for named cursors or something like that?
@n1000 It is OK as long as you think it is OK to take that client/server round trip. A PL/Python function would make it much simpler than dealing with cursors and client/server round trips.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.