0

I'm looking for a general rule of thumb on when it's faster to re-query the database, and when it's faster to use python and extract data from the cache.

Let's assume I need to extract two things simultaniously from the database: all pizzas, and a specific pizza with pk=5.

What's more optimized:

pizzas = Pizza.objects.all()   
specific_pizza = Piazza.objects.get(pk=5)

OR

pizzas = Pizza.objects.all()
for pizza in pizzas:
    if pizza.pk == 5
        specific_pizza = pizza
        break 

Of course it depends on the database. For example, if pizzas are 10 million rows, it's obvious that re-querying sql is better, and if pizzas are 10 rows, even if the field is indexed, python is probably faster.

Can anyone help what's more optimized in the middle range? For example, pizzas is hundreds of rows? thousands of rows?

3 Answers 3

2

There's no definitive answer to this question - as you said, it depends on the database (and probably also its location, the number and size of the tables, ...). You'll have to test in your particular environment.

Besides raw speed, there's some important advantages to using the first version:

  • It's shorter and clearer
  • The ORM knows exactly what you want, so any further optimizations can be done at that level, instead of pushing them to your application
  • It avoids doing (potentially) intensive computation in your web server

Also, some food for thought: if your tables are small enough that python is faster than the DB, does speed matter?

You may want to read on premature optimization

Sign up to request clarification or add additional context in comments.

6 Comments

(1) As long as I know a rule of thumb of when to use python vs sql, it doesn't cost me anything to write in advance the preferred option. (2) I don't have a lot of experience with sql, but as a rule of thumb, to go over 20 queries per page can be quite a lot. (3) All put together - data normalization, and use sql for all my queries without python, can lead me to more than 20 queries.
@ErezO You can't possibly "write in advance" unless you're sure the number rows stays within a very low limit. If you do know that there will be a fixed low number of rows, maybe SQL is not the best fit for your application. Also note that there won't be 20 queries when doing Pizza.objects.get(pk=5) - only one. Even if you need more complex operations (like joining multiple tables), you should be able to do most things with only one query - that's the whole point of a relational database, to push such operations to the server
Thx @goncalopp ch3ka. I don't think it's possible to reduce it to one query, that's why I asked the question. In a situation where I need to display pizzas according to several filters on the same page (e.g. all_pizzas, pk=5, topping=mushrooms, etc...) I don't know of a way I can do it in one query (If you do please tell me). Thus the question, of when it's preferable to use python. You and ch3ka both basically say that the rule of thumb should be to always use SQL, and never use python.
@ErezO If you have the time, you should do a thorough read of Django's documentation Making Queries. The section querysets are lazy answers your question, I think. Note that you can print the SQL a queryset represents with print myqueryset.query
@goncalopp believe me I've read it cover to back many times. I meant a situation where several filters need to be rendered into an html page simultaneously (like "all_pizzas" and "pizzas_with_mushrooms" and "pizza_no_5"). In this case, I see only two solutions - use sql for each query, or use sql only once and python for the rest. That's why I asked the original question of when to use python. BTW, I didn't accept the answer because I don't think it's correct. Had it been, Django wouldn't go to the trouble of improving prefetch_related.
|
1

for example, if pizzas are 10 million rows, it's obvious that re-querying sql is better, and if pizzas are 10 rows, even if the field is indexed, python is probably faster.

Well... first statement: yes. Second statement: not sure, but also not important. Because when there are only few pizzas, neighter command will take a noticeable time.

Can anyone help what's more optimized in the middle range?

Not like you expected, I guess, but yes: since we agree that using .get() will be faster when there are many pizzas, and since we see that performance is only a concern when there are many pizzas, considering the fact that the number of pizzas may grow in the future, I think we can agree that using .get() is the right thing to do.

Performance aside - it's also clearly more readable, so you really should go that route.

Also, note that you can use methods on a QuerySet (.all() returns a QuerySet!) to filter what you want. How this works is "magic behind the scenes" - and as such assumed to be optimized until evidence is found against that assumption. So you should use those methods, until you hit a point where targeted optimization is really needed. And if you ever hit that point, you can benchmark away and have a reliable answer.

1 Comment

thx @ch3ka, I commented on goncalopp's answer, but it was meant for both of you.
0

I appreciate @ch3ka and @goncalopp responses, but I didn't think they directly answered the question, so here's my shot of some profiling myself:

Bottom line, I found the point where python lookup is even to that of sql , to be around 1000 entries:

Assuming I've already queried the database and received 1000 pizzas:

pizzas = Pizza.objects.all()

I did two tests:

Test1: Find a specific pizza in 1000 pizzas using by looking at pk's:

for pizza in pizzas:
    if pizza.pk == 500
        specific_pizza = pizza
        break 

Took 0.2 miliseconds

Test2: Filter according to a member of pizza, and create a new list:

mushroom_pizzas=[pizza for pizza in pizzas if pizza.topping==Pizza.MUSHROOM]

where MUSHROOM is an enum of a possible topping. I chose enum, because I think it's a correct comparison to an indexed DB field

Took 0.3 miliseconds

Using Django debug toolbar, the time it takes for a simple indexed sql query is around 0.3 miliseconds.

I do think like @goncalopp and @ch3ka that since simple indexed queries are already 0.3 miliseconds, there's really no point going to python for optimization. So even if I know in advance the number of entries will be less than 1000, and even far less than 1000, I would still always use sql.

I'd appreciate any comments if I miscalculated or reached a wrong conclusion.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.