Python for loop csv concurrency

Question

I am stumbling accross large files 80.000 lines + which I do have to keep in my database. It takes like 20-30 min to push it all to my mysql database. I have a simple for loop, which just loops the whole csv.

import csv
import MySQLdb

# open the connection to the MySQL server.
# using MySQLdb
mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    for row in csv_reader:
    if "PVP_BIG" and "DATE_ADD" in row:
        print "First line removed"
    else:
        print "Not found!"
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        cursor.execute(sql, val)
        print row
#close the connection to the database.
#mydb.commit()
cursor.close()
print "CSV has been imported into the database"

Is there any method, I can divide it like to make it concurrent, so it will take like maybe 3-5 minutes based on the computer hardware?

also note that if "PVP_BIG" and "DATE_ADD" in row: will always eval to True — SuperStew
– SuperStew, Commented Nov 17, 2020 at 16:42
@SuperStew it won´t. I tested it. If both of them are found in a row of the csv, then it is True, otherwise it is not — Denni
– Denni, Commented Nov 17, 2020 at 16:44
ok then you can use dev.mysql.com/doc/refman/8.0/en/mysqlimport.html where you could specify --host=host_name --password[=password] etc — madzohan
– madzohan, Commented Nov 17, 2020 at 16:51

David Oldford · Accepted Answer · 2020-11-17 17:04:35Z

First thing you may get a big speedup by removing the print(row) from your inner loop. Everything else in the program waits on this action and it is an IO action that can take much longer than you might think. Secondly you might find a significant speedup by batching your INSERT statements, ie inserting more than one row at a time, say 100 or so. Thirdly the best way to do this is probably something involving asyncio but I don't have much experience with it. You're likely IO bound talking to the DB and getting data from the csv file and never doing both at once so I'd go with a simple two thread solution like below:

import csv
import MySQLdb
import threading 
from queue import Queue




def row_insert_thread(q: Queue, cursor, mydb):
    while True:
        command = q.get()
        if command is None:
            cursor.close()
            #mydb.commit()
            break
        cursor.execute(*command)

mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
        
insert_q = Queue()

row_thread = Thread(target=row_insert_thread,args=(insert_q,cursor,mydb)
row_thread.start()


with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    next(csv_reader) #skip the header row I'm assuming there is only one 
    for row in csv_reader:
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        insert_q.put((sql, val))
        print row
#close the connection to the database.

insert_q.put(None)
row_thread.join()

print "CSV has been imported into the database"

For the insert statement I'm not used to MySQL going from sqlite experience here, I think this will work:

def insert_multiple_rows(cursor, rows:list):
    sql = f"INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES {'(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s),'*len(rows)}"[:-1]
    args = [col for col in [row for row in rows]]
    cursor.execute(sql,args)

I expect you can integrate this into your code if you want to use it just change the thread to take a list then in the main loop add values to the list until it reaches whatever number you want or you run out of rows, then put the list into the insert_q

Collectives™ on Stack Overflow

Python for loop csv concurrency

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related