I am trying to get the count of entries by users grouped by year, month and user name from a table which has 45M entries. The query result has around 4M records which I wasn't able to get in one go so I decided to use limit and offset.
To retrieve the first 1M records I've written the query below:
select SQL_BIG_RESULT uis.nick, uis.user_id, CONCAT(t.year, '-', LPAD(t.month, 2, 0)) AS DATE, t.count
from (select SQL_BIG_RESULT e.user_id, YEAR(e.created_at) as year, MONTH(e.created_at) as month, COUNT(*) AS count
from entries e
group by YEAR(e.created_at), MONTH(e.created_at), e.user_id
limit 1000000
) t
inner join users u on u.id = t.user_id
inner join user_infos ui on ui.user_id = u.id
inner join user_identifiers uis on uis.user_info_id = ui.id
order by t.year, t.month, uis.nick;
To retrieve the second 1M records I've set an offset of 999998 so I would have 2 overlapping rows so that I could double check that it's correct, hence this query below:
select SQL_BIG_RESULT uis.nick, uis.user_id, CONCAT(t.year, '-', LPAD(t.month, 2, 0)) AS DATE, t.count
from (select SQL_BIG_RESULT e.user_id, YEAR(e.created_at) as year, MONTH(e.created_at) as month, COUNT(*) AS count
from entries e
group by YEAR(e.created_at), MONTH(e.created_at), e.user_id
limit 999998, 1000000
) t
inner join users u on u.id = t.user_id
inner join user_infos ui on ui.user_id = u.id
inner join user_identifiers uis on uis.user_info_id = ui.id
order by t.year, t.month, uis.nick;
Then to compare the results and double check, I've got the tail of the first 1M records and the head of the second 1M records. There should be 2 overlapping records in my understanding -since I've used an offset of 999998- but there is something wrong.
It's also evident that there is something wrong with the query because the first file ends with zzzzz but then the second file starts with 0 3 kalem ucu which should not be after z in alphabetical order.
$ tail entry_counts_by_users_1_1m.csv
| user_nick | user_id | date | entry_count |
|-------------|---------|---------|-------------|
| zskal | 493395 | 2013-05 | 8 |
| zuhanzee | 397659 | 2013-05 | 2 |
| zulmet | 446672 | 2013-05 | 74 |
| zuluuuuuu | 1240043 | 2013-05 | 9 |
| zverkov | 502616 | 2013-05 | 2 |
| zvezdite | 750458 | 2013-05 | 1 |
| zx | 249598 | 2013-05 | 15 |
| zyprexa 5mg | 779519 | 2013-05 | 16 |
| zzgx | 584985 | 2013-05 | 2 |
| zzzzz | 22730 | 2013-05 | 1 |
$ head entry_counts_by_users_1m_2m.csv
| nick | user_id | DATE | count |
|---------------|---------|---------|-------|
| 0 3 kalem ucu | 624699 | 2013-05 | 4 |
| 0132 | 995914 | 2013-05 | 3 |
| 03072010 | 960606 | 2013-05 | 9 |
| 0312020008 | 804486 | 2013-05 | 2 |
| 0326 | 446816 | 2013-05 | 1 |
| 05 | 575534 | 2013-05 | 1 |
| 05012009 | 1171153 | 2013-05 | 6 |
| 0904 | 514964 | 2013-05 | 2 |
| 0kmzeka | 777191 | 2013-05 | 4 |
Could you help me understand what I am doing wrong here?
+-----------+
| @@version |
+-----------+
| 8.0.19 |
+-----------+
UPDATE
These are the results I get after using ORDER BY in my subquery:
select SQL_BIG_RESULT uis.nick, uis.user_id, CONCAT(t.year, '-', LPAD(t.month, 2, 0)) AS DATE, t.count
from (select SQL_BIG_RESULT e.user_id, YEAR(e.created_at) as year, MONTH(e.created_at) as month, COUNT(*) AS count
from entries e
group by YEAR(e.created_at), MONTH(e.created_at), e.user_id
order by year, month, user_id
limit 1000000) t
inner join users u on u.id = t.user_id
inner join user_infos ui on ui.user_id = u.id
inner join user_identifiers uis on uis.user_info_id = ui.id
For the first 1M records:
$ tail entry_counts_by_users_1_1m.csv
| user_name | user_id | date | entry_count |
|----------------------------|---------|---------|-------------|
| statistic er | 667546 | 2012-06 | 1 |
| mula | 612905 | 2013-02 | 1 |
| sisman cirkin bi de kezban | 1327434 | 2013-02 | 2 |
| tyra34 | 1329280 | 2013-03 | 1 |
| ecemazkan | 1332628 | 2013-02 | 1 |
| susamlicubuk | 1333079 | 2013-02 | 1 |
| hemenhemenherterim | 631784 | 2011-04 | 1 |
| umursamaz tavrin hastasi | 1060158 | 2012-09 | 2 |
| uslucocuk | 1254758 | 2012-09 | 1 |
| dharamsala | 956110 | 2012-09 | 1 |
select SQL_BIG_RESULT uis.nick, uis.user_id, CONCAT(t.year, '-', LPAD(t.month, 2, 0)) AS DATE, t.count
from (select SQL_BIG_RESULT e.user_id, YEAR(e.created_at) as year, MONTH(e.created_at) as month, COUNT(*) AS count
from entries e
group by YEAR(e.created_at), MONTH(e.created_at), e.user_id
order by year, month, user_id
limit 999998, 1000000) t
inner join users u on u.id = t.user_id
inner join user_infos ui on ui.user_id = u.id
inner join user_identifiers uis on uis.user_info_id = ui.id
For the second 1M records:
$ head entry_counts_by_users_1m_2m.csv
| user_name | user_id | date | entry_count |
|-----------|---------|---------|-------------|
| ssg | 8097 | 2013-06 | 101 |
| ssg | 8097 | 2013-07 | 73 |
| ssg | 8097 | 2013-08 | 100 |
| ssg | 8097 | 2013-09 | 88 |
| ssg | 8097 | 2013-10 | 84 |
| ssg | 8097 | 2013-11 | 54 |
| ssg | 8097 | 2013-12 | 64 |
| ssg | 8097 | 2014-01 | 78 |
| ssg | 8097 | 2014-02 | 31 |
I still don't get what I am doing wrong.
ORDER BYthen engine will return the rows in any order (though @Shadow disagrees with this statement). If the rows are returned in any order, thenLIMITwon't work as you expect. I would suggest you add theORDER BYand this problem may disappear; ...and it's very cheap to do.