1

I'm loop querying a few date ranges:

con = MySQLdb.connect(host='xxx.com', port=3306,user='user', passwd='pass', db='db')

intervals = 10
precision = 1

dateChunks = list()
for i in range(0,intervals,precision):
    results = load_data_window(i,precision)
    dateChunks.append(results)

def load_data_window(start,precision):
    length = start + precision # <-- time period 
    limit = 20000 
    cur = con.cursor()
    sql = "SELECT data FROM table WHERE date < DATE_SUB(NOW(), INTERVAL %s HOUR) AND date > DATE_SUB(NOW(), INTERVAL %s HOUR)"
    cur.execute(sql,(start,length))
    results =  cur.fetchall()

It works lightning fast the first few loops, sometimes even all of them, but bogs significantly from time to time. There are other actions happening on the database from different places... Is there something I can do to ensure my query has priority? Something like a transaction?

######### EDIT

I did notice, if I move the con = MSQLdb... inside of the load_data_window function, I get really fast results and then bogging, whereas if I keep the con = MSQLdb... outside it is consistently slower...

1 Answer 1

1

Does your date column have an index on it? Such an index may help performance a great deal. That's especially true if your queries progressively slow down: MySQL may be scanning through your table, further and further for each successive hour, looking for the first row to return.

It looks like you're fetching an hour's worth of data with each request, and running fetchall() to slurp all that data into RAM in your python program at once. Is that what you want? If you have an hour with lots of results in it you may hammer the RAM in your python program, forcing garbage collection and even operating-system thrashing.

Can you, instead, cause your python program to iterate over the rows in the result set? This will most likely take a lot less RAM. You can do something like this: http://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-fetchone.html

cur.execute(sql,(start,length))
for row in cur:
    print(row)

This sort of strategy would also allow you to process your data efficiently in chunks bigger than an hour if that would help your application. In many, but not all, cases fewer queries means better queries.

Do some of your hours have hugely more rows in them? This could be causing the unpredictable performance.

Finally, you have an off-by-one bug in your query. It should say this...

 WHERE ...  AND date >= DATE_SUB(NOW(), INTERVAL %s HOUR)

(Notice the >= replacing your >)

Sign up to request clarification or add additional context in comments.

1 Comment

Ollie, thank you for your input. I'm giving you a point for all the great feedback and suggestions. I'm leaving the question open just in case someone has some more insights. It's a pretty open ended question, so thank you for your time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.