6

EDIT: After looking at some of the answers here and hours of research, my team came to the conclusion there was most likely no way to optimize this further than the 4.5 seconds we were able to achieve (unless maybe with partitioning on offers_clicks, but that would have some ugly side-effects). Eventually, after lots of brainstorming, we decided to split both queries, create two sets of user ids (one from users table and one from offers_clicks), and compare them with set in Python. The set of ids from users table is still pulled from SQL, but we decided to move offers_clicks to Lucene and also added some caching on top of it, so that's where the other set of ids is now pulled from. The end result is that its down to about half a second with cache and 0.9s without cache.

Start of original post: I have trouble getting a query optimized. The first version of the query is fine, but the moment offers_clicks is joined in the 2nd query, the query becomes rather slow. Users table contains 10 million rows, offers_clicks contains 53 million rows.

Acceptable performance:

SELECT count(distinct(users.id)) AS count_1
FROM users USE index (country_2)
WHERE users.country = 'US'
  AND users.last_active > '2015-02-26';
1 row in set (0.35 sec)

Bad:

SELECT count(distinct(users.id)) AS count_1
FROM offers_clicks USE index (user_id_3), users USE index (country_2)
WHERE users.country = 'US'
  AND users.last_active > '2015-02-26'
  AND offers_clicks.user_id = users.id
  AND offers_clicks.date > '2015-02-14'
  AND offers_clicks.ranking_score < 3.49
  AND offers_clicks.ranking_score > 0.24;
1 row in set (7.39 sec)

Here's how it looks without specificying any indexes (even worse):

SELECT count(distinct(users.id)) AS count_1
FROM offers_clicks, users
WHERE users.country IN ('US')
  AND users.last_active > '2015-02-26'
  AND offers_clicks.user_id = users.id
  AND offers_clicks.date > '2015-02-14'
  AND offers_clicks.ranking_score < 3.49
  AND offers_clicks.ranking_score > 0.24;
1 row in set (17.72 sec)

Explain:

explain SELECT count(distinct(users.id)) AS count_1 FROM offers_clicks USE index (user_id_3), users USE index (country_2) WHERE users.country IN ('US') AND users.last_active > '2015-02-26' AND offers_clicks.user_id = users.id AND offers_clicks.date > '2015-02-14' AND offers_clicks.ranking_score < 3.49 AND offers_clicks.ranking_score > 0.24;
+----+-------------+---------------+-------+---------------+-----------+---------+------------------------------+--------+--------------------------+
| id | select_type | table         | type  | possible_keys | key       | key_len | ref                          | rows   | Extra                    |
+----+-------------+---------------+-------+---------------+-----------+---------+------------------------------+--------+--------------------------+
|  1 | SIMPLE      | users         | range | country_2     | country_2 | 14      | NULL                         | 245014 | Using where; Using index |
|  1 | SIMPLE      | offers_clicks | ref   | user_id_3     | user_id_3 | 4       | dejong_pointstoshop.users.id | 270153 | Using where; Using index |
+----+-------------+---------------+-------+---------------+-----------+---------+------------------------------+--------+--------------------------+

Explain without specifying any indexes:

mysql> explain SELECT count(distinct(users.id)) AS count_1 FROM offers_clicks, users WHERE users.country IN ('US') AND users.last_active > '2015-02-26' AND offers_clicks.user_id = users.id AND offers_clicks.date > '2015-02-14' AND offers_clicks.ranking_score < 3.49 AND offers_clicks.ranking_score > 0.24;
+----+-------------+---------------+-------+------------------------------------------------------------------------+-----------+---------+------------------------------+--------+--------------------------+
| id | select_type | table         | type  | possible_keys                                                          | key       | key_len | ref                          | rows   | Extra                    |
+----+-------------+---------------+-------+------------------------------------------------------------------------+-----------+---------+------------------------------+--------+--------------------------+
|  1 | SIMPLE      | users         | range | PRIMARY,last_active,country,last_active_2,country_2                    | country_2 | 14      | NULL                         | 221606 | Using where; Using index |
|  1 | SIMPLE      | offers_clicks | ref   | user_id,user_id_2,date,date_2,date_3,ranking_score,user_id_3,user_id_4 | user_id_2 | 4       | dejong_pointstoshop.users.id |      3 | Using where              |
+----+-------------+---------------+-------+------------------------------------------------------------------------+-----------+---------+------------------------------+--------+--------------------------+

Here's a whole bunch of indexes I tried with not too much success:

+---------------+------------+-----------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name                    | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| offers_clicks |          1 | user_id_3                   |            1 | user_id         | A         |         198 |     NULL | NULL   |      | BTREE      |         |               |
| offers_clicks |          1 | user_id_3                   |            2 | ranking_score   | A         |         198 |     NULL | NULL   |      | BTREE      |         |               |
| offers_clicks |          1 | user_id_3                   |            3 | date            | A         |         198 |     NULL | NULL   |      | BTREE      |         |               |
| offers_clicks |          1 | user_id_2                   |            1 | user_id         | A         |    17838712 |     NULL | NULL   |      | BTREE      |         |               |
| offers_clicks |          1 | user_id_2                   |            2 | date            | A         |    53516137 |     NULL | NULL   |      | BTREE      |         |               |
| offers_clicks |          1 | user_id_4                   |            1 | user_id         | A         |         198 |     NULL | NULL   |      | BTREE      |         |               |
| offers_clicks |          1 | user_id_4                   |            2 | date            | A         |         198 |     NULL | NULL   |      | BTREE      |         |               |
| offers_clicks |          1 | user_id_4                   |            3 | ranking_score   | A         |         198 |     NULL | NULL   |      | BTREE      |         |               |
| users         |          1 | country_2                   |            1 | country         | A         |          14 |     NULL | NULL   |      | BTREE      |         |               |
| users         |          1 | country_2                   |            2 | last_active     | A         |     8048529 |     NULL | NULL   |      | BTREE      |         |               |

Simplified users schema:

+---------------------------------+---------------+------+-----+---------------------+----------------+
| Field                           | Type          | Null | Key | Default             | Extra          |
+---------------------------------+---------------+------+-----+---------------------+----------------+
| id                              | int(11)       | NO   | PRI | NULL                | auto_increment |
| country                         | char(2)       | NO   | MUL |                     |                |
| last_active                     | datetime      | NO   | MUL | 2000-01-01 00:00:00 |                |

Simplified offers clicks schema:

+-----------------+------------------+------+-----+---------------------+----------------+
| Field           | Type             | Null | Key | Default             | Extra          |
+-----------------+------------------+------+-----+---------------------+----------------+
| id              | int(11)          | NO   | PRI | NULL                | auto_increment |
| user_id         | int(11)          | NO   | MUL | 0                   |                |
| offer_id        | int(11) unsigned | NO   | MUL | NULL                |                |
| date            | datetime         | NO   | MUL | 0000-00-00 00:00:00 |                |
| ranking_score   | decimal(5,2)     | NO   | MUL | 0.00                |                |
8
  • 1
    Please post your schema! Commented Mar 8, 2015 at 21:33
  • 1
    Note that DISTINCT is not a function Commented Mar 12, 2015 at 7:39
  • Strawberry, +1. the parentheses used after distinct are simply ignored. distinct(user.id) is better as distinct user.id because "distinct is not a function" Commented Mar 12, 2015 at 7:48
  • what are the record counts for both tables please? Commented Mar 12, 2015 at 8:04
  • 10 million in users, 53 million in offers_clicks Commented Mar 12, 2015 at 21:22

6 Answers 6

5
+300

This is your query:

SELECT count(distinct u.id) AS count_1
FROM offers_clicks oc JOIN
     users u
     ON oc.user_id = u.id
WHERE u.country IN ('US') AND u.last_active > '2015-02-26' AND
      oc.date > '2015-02-14' AND
      oc.ranking_score > 0.24 AND oc.ranking_score < 3.49;

First, instead of count(distinct), you might consider writing the query as:

SELECT count(*) AS count_1
FROM users u
WHERE u.country IN ('US') AND u.last_active > '2015-02-26' AND
      EXISTS (SELECT 1
              FROM offers_clicks oc
              WHERE oc.user_id = u.id AND
                    oc.date > '2015-02-14' AND
                    oc.ranking_score > 0.24 AND oc.ranking_score < 3.49
             )

Then, the best indexes for this query are: users(country, last_active, id) and either offers_clicks(user_id, date, ranking_score) or offers_clicks(user_id, ranking_score, date).

Sign up to request clarification or add additional context in comments.

11 Comments

I tried this with users(country, last_active) and offers_clicks(user_id, date, ranking_score). Speed is about the same. 1 row in set (6.45 sec). How important is id in the compound index on users table? I'd like to learn how to impacts the query. I can try to add an index on (country, last_active and id) tomorrow and see how that impacts things.
Can you try the query using = 'US' rather than in? That might be preventing optimal use of the index. user_id isn't that important. It just allows the index to be a covering index, so the engine doesn't have to fetch data from the data pages.
Thanks Gordon; I will try adding "id" into the compound index on users table tomorrow. I tried = 'US' earlier as well; did not seem to make a big difference of any difference at all (didn't fully benchmark it, but speed seem to be about the same).
Added the index to cover country, last_active and id in users table, unfortunately it has not made much of a difference. SELECT count(*) AS count_1 FROM users u USE INDEX (country_3) WHERE u.country = 'US' AND u.last_active > '2015-02-26' AND EXISTS (SELECT 1 FROM offers_clicks oc USE INDEX (user_id_3) WHERE oc.user_id = u.id AND oc.date > '2015-02-14' AND oc.ranking_score > 0.24 AND oc.ranking_score < 3.49);
Provide the EXPLAIN SELECT... without and with id in the (country, last_active) index. They will probably be identical if the table is InnoDB. This because the PRIMARY KEY is silently appended to each secondary key.
|
1
SELECT count(distinct u.id) AS count_1
FROM users u
STRAIGHT_JOIN offers_clicks oc
     ON oc.user_id = u.id
WHERE 
    u.country IN ('US') 
    AND u.last_active > '2015-02-26' 
    AND oc.date > '2015-02-14' 
    AND oc.ranking_score > 0.24 
    AND oc.ranking_score < 3.49;

Make sure you have index on users - (id,last_active,country) columns and offers_clicks - (user_id,date,ranking_score)

Or you can reverse the order

SELECT count(distinct u.id) AS count_1
FROM offers_clicks oc 
STRAIGHT_JOIN users u
     ON oc.user_id = u.id
WHERE 
    u.country IN ('US') 
    AND u.last_active > '2015-02-26' 
    AND oc.date > '2015-02-14' 
    AND oc.ranking_score > 0.24 
    AND oc.ranking_score < 3.49;

Make sure you have index on offers_clicks - (user_id) column and users - (id,last_active,country)

Comments

0
SELECT count(users.id) AS count_1 
FROM users 
INNER JOIN
  (SELECT
    DISTINCT user_id
  FROM
    offers_clicks
  WHERE offers_clicks.date > '2015-02-14' 
    AND offers_clicks.ranking_score < 3.49 
    AND offers_clicks.ranking_score > 0.24
  ) as clicks
ON clicks.user_id  = users.id
WHERE users.country IN ('US') 
    AND users.last_active > '2015-02-26' 

could you provide sqlfiddle with some data please?

and could you tell me what is execution time for this query:

SELECT
    DISTINCT user_id
  FROM
    offers_clicks
  WHERE offers_clicks.date > '2015-02-14' 
    AND offers_clicks.ranking_score < 3.49 
    AND offers_clicks.ranking_score > 0.24

EDIT QUESTION How long takes this one?

SELECT
    DISTINCT user_id
  FROM
    offers_clicks USE INDEX (user_id_4)
  WHERE offers_clicks.date > '2015-02-14' 
    AND offers_clicks.ranking_score < 3.49 
    AND offers_clicks.ranking_score > 0.24

8 Comments

I will try to setup sqlfiddle tomorrow. The executation time of just offers_clicks is about 4-5 seconds, almost as slow as your query including users (which runs at about 5-6 seconds, about 1-2 seconds faster than the original query).
Here's the explain on the offers_clicks query btw: | 1 | SIMPLE | offers_clicks | range | date,date_2,date_3,ranking_score | date_2 | 8 | NULL | 2738102 | Using where; Using temporary |
but does it bring correct result? it is better (5-6) than you had before (17-18)? so now I just have improve it to get less than 1 s?
@MathijsdeJong colud you provide sqlfiddle with offers_clicks with 1 000-10 000 records please? or just a .sql file with exported table please?
I appreciate your efforts, but 5-6s (yours, as well as some of the other posts) or 6-7s (original) seconds is both very heavy. I was initially looking for sub 1s, but my guess is that offers_clicks is just too large of a table to do any meaningful queries on it... I'm afraid I asked the impossible.
|
0

Try doing this other way around:

SELECT COUNT(users.id)
    FROM users, offers_clicks
    WHERE users.country = 'US'
        AND users.last_active > '2015-02-26'
        AND offers_clicks.user_id = users.id
        AND offers_clicks.date > '2015-02-14'
        AND offers_clicks.ranking_score < 3.49
        AND offers_clicks.ranking_score > 0.24;

Comments

0

First of all i also think that you should use join, and try to join only rows that you really need in result.
As for table offers_clicks i think you should not use index user_id_3 and use user_id_2 because the cardinality of user_id_2 is higher than cardinality of user_id_3 (accordingly to your indexes) and it should be faster.

SELECT
    count(distinct(users.id)) AS count_1
FROM users USE INDEX (country_2)
JOIN offers_clicks USE INDEX (user_id_2)
    ON  offers_clicks.user_id = users.id
    AND offers_clicks.date > '2015-02-14'
    AND offers_clicks.ranking_score < 3.49
    AND offers_clicks.ranking_score > 0.24
WHERE users.country = 'US' AND users.last_active > '2015-02-26'
;

For this query you don't need altering table, that's why i think you can try it.
Maybe will be helpful to try decrease date range, and as result to decrease rows count in result, it should be faster.

Not sure that i will be helpful...

Comments

0

Try this:

SELECT count(distinct users.id) AS count_1
FROM users USE index (<see below>)
JOIN offers_clicks USE index (<see below>)
    ON offers_clicks.user_id = users.id
    AND offers_clicks.date BETWEEN '2015-02-14' AND CURRENT_DATE
    AND offers_clicks.ranking_score BETWEEN 0.24 AND 3.49
WHERE users.country = 'US'
AND users.last_active BETWEEN '2015-02-26' AND CURRENT_DATE

Make sure there are indexes on users(country, last_active, id) and offers_clicks(user_id, ranking_score, date) and USE them.

Let me know how it performs and if it works I'll explain why.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.