How to optimize queries with user-defined parameters in the "where" clause?

Question

I'm learning how to do proper query optimization using indexes. Let's say I have a huge table of products with all kinds of details for each product, e.g. price, category, number of purchases, review average, and more. When having multiple "where" conditions, I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.

However, I'm having difficulty figuring out how to scale it if there are so many queries for different purposes, and if users get to pick how to filter the products table. For example, a user can browse products WHERE rating > 4 AND purchases > 100, or it could be WHERE category = 'x' AND price < 100 AND price > 20. How would a proper multi-column index work if the columns chosen to be filtered are random?

Michael - sqlbot · Accepted Answer · 2013-10-28 03:46:04Z

I learned that it's best to put a multi-column index on whatever your "where" conditions are, in the order that they appear.

You learned... not quite correctly.

The order of appearance in the WHERE clause is not meaningful, since the optimizer is free to evaluate the conditions in any logically valid way, subject of course to parentheses and logical operators (AND, OR, etc.) in the expression.

The order of columns in a multi-column index is important because, from left to right, as soon as a column is encountered in an index that not mentioned in the where clause, nothing more toward the right side of that index can be used.

If 3 columns, (a,b,c) are indexed, and the query is WHERE a = 1 AND c = 6 then the optimizer will only be able to use the left-most "a" column values in that index, not "c".

In that case, it would likely still choose to use the index to find rows where a = 1, and then scan all of those identified rows for only those with c = 6.

You can visualize a multi-column index as a multidimensional array. Without a known value or range you need to match for the first column (a), the values for the second column (b) are a meaningless, unordered jumble of data, because they're sorted in "groups of 'a'"... you'd have to iterate through every "a" to find the matching "b" values, and iterate through every "a,b" to find the matching "c" values. Since, in the example above, the "b" value is "anything" since it isn't specified, the ordering of the "c" values is meaningless and inaccessible for optimizing the query (although when every column within the SELECT list is available within a single index, the optimizer may scan the index instead of scanning the whole table, treating it as a "covering index," which is generally better than a full table scan but still suboptimal).

If your WHERE clause includes two columns both of which are indexed individually, the optimizer will check the index statistics and try to use the one that is most likely to produce the fewest matches... if "a" and "c" each have an individual index, and the index stats indicate that there are many values for "c" (high cardinality) but only a few values for "a" (low cardinality) the optimizer will typically use the index on "c" to find matching rows, then scan all of those rows for the requested values of "a".

Or, it may try to use the union of the two indexes, to precisely identify which rows satisfy both conditions.

Neither of these strategies is optimal, either, but still far better than a full table scan, so itdoes suggest that you should -- at a minimum -- have every independently-searchable column as the leftmost column in an index... that is, any column that can be queried on its own, with no other columns in the WHERE clause, and return a reasonably-sized result-set. If the result-set will not be reasonable in size, you may wish to restrict the user to searching on additional attributes, in the application.

In the case of WHERE category = 'x' AND price < 100 AND price > 20 the better index would be (category,price) and not (price,category) but this is not because of the ordering of expressions in the WHERE clause. It is because category is an equality test, but price is a range. WHERE price < 100 AND price > 20 AND category ='x' is equivalent, and (category,price) is still the appropriate index -- because indexes are sorted by the first column, then within each value for the first column, they are sorted by the values of the second column, then within each (first,second) pair they are sorted by the values in the third column, ad infinitum... so with (category,price) the server goes directly to all of the rows for category = 'x' and within that grouping in the index, the referenced rows are already sorted by price, so it only has to select the range of price within the category 'x' of the index. Optimal. The (price,category) index requires checking all the prices in the range, and then checing the category value for all of those. The index could still be used, but depending on the criteria, the optimizer could still opt to scan the whole table.

If you add a third criteria to the WHERE clause that isn't indexed, the same path will be followed, but the server will scan the identified rows for matches with the required value of the non-indexed column. Again, suboptimal, but often acceptable, depending on your business needs -- which play a role in determining the correct answer to this question.

Every index requires space, and resources, because every insert, update, and delete, requires that the server make the necessary changes -- right then -- to every index that is affected by the changes to the table.

Note also that if you have an index on (a,b) or (a,b,c), etc., then a separate index on (a) generally is considered a waste of space, since the index on (a,...anything-else...) will also serve as an index on (a).

Experimenting with EXPLAIN SELECT (which also supports INSERT/UPDATE/DELETE as of MySQL 5.6) and genuinely understanding its output is an indispensable tool for understanding how indexes work. MySQL 5.6 also supports optimizer tracing, which gives you detailed output of how the optimizer understood your query, the various plans it considered, the cost it estimated of each plan, and how it arrived at the decision of how to execute a particular query.

Thank you very much for the explanation. Though I understand some of it, I must admit, some of it is still over my head. I'll have to read the EXPLAIN output page when I have the time, and look at an optimization chapter in a MySQL book. Thanks again.
Thank you. Please consider accepting the answer or let me know if there are points I can clarify.

Collectives™ on Stack Overflow

How to optimize queries with user-defined parameters in the "where" clause?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related