I need to read data from a huge table (>1million rows, 16 cols of raw text) and do some processing on it. Reading it row by row seems very slow (python, MySQLdb) indeed and I would like to be able to read multiple rows at a time (possibly parallelize it).
Just FYI, my code currently looks something like this:
cursor.execute('select * from big_table')
rows = int(cursor.rowcount)
for i in range(rows):
row = cursor.fetchone()
.... DO Processing ...
I tried to run multiple instances of the program to iterate over different sections of the table (for example, the 1st instance would iterate over 1st 200k rows, 2nd instance would iterate over rows 200k-400k ...) but the problem is that the 2nd instance (and 3rd instance and so on) takes FOREVER to get to a stage where it starts looking at row 200k onwards. It almost seems like it is still doing the processing of 1st 200k rows instead of skipping over them. The code I use (for 2nd instance) in this case is something like:
for i in range(rows):
#Fetch the row but do nothing (need to skip over 1st 200k rows)
row = cur.fetchone()
if not i in range(200000,400000):
continue
.... DO Processing ...
How can I speed up this process? Is there a clean way to do faster/parallel reads from MySQL database through python?
EDIT 1: I tried the "LIMIT" thing based on the suggestions below. For some reason though when I start 2 processes on my quad core server, it seems like only 1 single process is being run at a time (CPU seems to be time sharing between these processes, as opposed to each core running a separate process). The 2 python processes are using respectively 14% and 9% of the CPUs. Any thoughts what might be wrong?
LIMITclause but Ignacio already did that. The expense of a DB read occurs when you call a function likefetchone. Your code doesn't skip rows of the database. It simply skips your processing. The expensive DB stuff (IO and memory thrashing) are occurring for each row in each process.ifor something specific in your loop, you can simply writefor row in cursorinstead offor i in range(rows): row = cursor.fetchone().