Python code optimisation for an sql insert

Question

I have the following code and I'm running it on some big data (2 hours processing time), I'm looking into CUDA for GPU acceleration, but in the mean time can anyone suggest ways to optimise the following code?

I is taking a 3D point from dataset 'T' and finding the point with the minimum distance to another point dataset 'B'

Is there any time saved by sending the result to a list first then inserting to the database table?

All suggestions welcome

    conn = psycopg2.connect("<details>")
    cur = conn.cursor()

    for i in range(len(B)):
        i2 = i + 1
        # point=T[i]
        point = B[i:i2]
        # print(B[i])
        # print(B[i:i2])
        disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
        print("Base: ", end='')
        print(i, end='')
        print(" of ", end='')
        print(len(B), end='')
        print(" ", end='')
        print(disti)

        cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""",
                    (xi[i], yi[i], zi[i], disti))
        conn.commit()

    cur.close()

@@@@@@@@@@@@@@ EDIT @@@@@@@@@@@@@

Code update:

   conn = psycopg2.connect("dbname=kap_pointcloud host=localhost user=postgres password=Gnob2009")
    cur = conn.cursor()

    disti = []

    for i in range(len(T)):
        i2 = i + 1
        point = T[i:i2]
        disti.append(scipy.spatial.distance.cdist(point, B, metric='euclidean').min())
        print("Top: " + str(i) + " of " + str(len(T)))

Insert code to go here once I figure out the syntax

@@@@@@@@ EDIT @@@@@@@@

The solution with a lot of help from Alex

   cur = conn.cursor()
      # list for accumulating  insert-params
    from scipy.spatial.distance import cdist

    insert_params = []
    for i in range(len(T)):
        XA = [B[i]]
        disti = cdist(XA, XB, metric='euclidean').min()
        insert_params.append((xi[i], yi[i], zi[i], disti))
        print("Top: " + str(i) + " of " + str(len(T)))

    # Only one instruction to insert everything
    cur.executemany("INSERT INTO pc_processing.pc_dist_top_tmp (x,y,z,dist) values (%s, %s, %s, %s)",
                    insert_params)
    conn.commit()

For timing comparison the:

inital code took: 0:00:50.225644

Without multiline prints: 0:00:47.934012

taking commit out of the loop: 0:00:25.411207

I'm assuming the only way to make it faster is to get CUDA working?

I suppose you're overcommitting here. Can you: a) move conn.commit outside of loop, b) prepare data in loop and then executemany from prepared data, c) use bulk-loading (COPY) from prepared data — Alex Yu
– Alex Yu, Commented Feb 7, 2019 at 11:47
And print-s are not free. Remove them or at least make only 1 print instead of 7. — Alex Yu
– Alex Yu, Commented Feb 7, 2019 at 11:49
Thanks moving the print lines to a single lined saved 3 seconds for 8000 records, seeing as I'm using 80,000+ records that should be a 30 second+ saving. But I suspect the real savings will be in the commitment stage when communicating with the database. — Spatial Digger
– Spatial Digger, Commented Feb 7, 2019 at 14:26
Redirect to /dev/null will be even faster. BUT real savings will be: a) commit larger chunks of works, b) executemany, c) BULK operations — Alex Yu
– Alex Yu, Commented Feb 7, 2019 at 14:31
I've tried execute many cur.executemany('insert into pc_processing.pc_dist_base_tmp(x) values (%s)', [(x,) for x in xi]) this works but if I add ...values (%s, %s)', [(x,) for x in xi], [(y,) for y in yi]) it complains about too many variables. — Spatial Digger
– Spatial Digger, Commented Feb 7, 2019 at 15:17

Harshit Agarwal · Accepted Answer · 2019-02-07 12:26:10Z

3

There are 2 solutions

1) Try to do the single commit or commit in chunks if len(B) is very large.

2) you can prepare a list of data that you are inserting and do the bulk insert.

eg:

insert into pc_processing.pc_dist_base_tmp (x, y, z, dist) select * from unnest(array[1, 2, 3, 4], array[1, 2, 3, 4]);

answered Feb 7, 2019 at 12:26

Harshit Agarwal

9183 gold badges10 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Spatial Digger Over a year ago

This works with your example but I'm trying to pass an array and it fails

cur.execute(         """         insert into pc_processing.pc_dist_base_tmp(x, y, z, dist)         select * from unnest(xi, yi, zi, disti);         """         )

all arrays are the same length, it is complaining the column xi does not exist, which is not the case.

Spatial Digger Over a year ago

xi was a list object so I xi = np.asarray(xi) but the error persists

Alex Yu · Accepted Answer · 2019-02-07 15:54:14Z

OK. Let's accumulate all suggestions from comments.

Suggesion 1. `commit` as rare as possible, don't `print` at all

conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[]

for i in range(len(B)):
    i2 = i + 1
    point = B[i:i2]
    disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()        
    cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""", (xi[i], yi[i], zi[i], disti))        

conn.commit() # Note that you commit only once. Be careful with **realy** big chunks of data
cur.close()

If you really need debug information inside your loops - use logging.

You will be able to turn on/off logging info when you need.

Suggestion 2. `executemany` for rescue

conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[] # list for accumulating  insert-params 

for i in range(len(B)):
    i2 = i + 1
    point = B[i:i2]
    disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()        
    insert_params.append((xi[i], yi[i], zi[i], disti))

# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)", insert_params)                     
conn.commit()
cur.close()

Suggestion 3. Don't use `psycopg2` at all. Use BULK operations

Instead of cur.execute, conn.commit write csv-file. And then use COPY from created file.

BULK solution must provide ultimate performance but needs an effort to make it work.

Choose yourself what is appropriate for you - how much speed do you need.

Good luck

Vizze · Accepted Answer · 2019-02-07 11:47:29Z

1

Try committing when the loop is finished instead of every single iteration

answered Feb 7, 2019 at 11:47

Vizze

112 bronze badges

Collectives™ on Stack Overflow

Python code optimisation for an sql insert

@@@@@@@@@@@@@@ EDIT @@@@@@@@@@@@@

3 Answers 3

2 Comments

Suggesion 1. `commit` as rare as possible, don't `print` at all

Suggestion 2. `executemany` for rescue

Suggestion 3. Don't use `psycopg2` at all. Use BULK operations

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

@@@@@@@@@@@@@@ EDIT @@@@@@@@@@@@@

3 Answers 3

2 Comments

Suggesion 1. commit as rare as possible, don't print at all

Suggestion 2. executemany for rescue

Suggestion 3. Don't use psycopg2 at all. Use BULK operations

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Suggesion 1. `commit` as rare as possible, don't `print` at all

Suggestion 2. `executemany` for rescue

Suggestion 3. Don't use `psycopg2` at all. Use BULK operations