4

I have a database with several tables, the ones involved in this query that I want to optimize are only 4.

albums, songs, genres, genre_song

A song can have many genres, and a genre many songs. An album can have many songs. An album is related to genres through songs.

The objective is to be able to recommend albums related to the genre of the album.

So that led me to have this query.

SELECT *
FROM `albums`
WHERE EXISTS
    (SELECT *
     FROM `songs`
     WHERE `albums`.`id` = `songs`.`album_id`
       AND EXISTS
         (SELECT *
          FROM `genres`
          INNER JOIN `genre_song` ON `genres`.`id` = `genre_song`.`genre_id`
          WHERE `songs`.`id` = `genre_song`.`song_id`
            AND `genres`.`id` IN (6)))
  AND `id` <> 37635
  AND `published` = 1
ORDER BY `release_date` DESC
LIMIT 6

This query takes me between 1.4s and 1.6s. I would like to reduce it as much as possible. The ideal goal would be less than 10ms 😁

I am already using index in several tables, I have managed to reduce times in other queries from up to 4 seconds to only 15-20ms. I am willing to use anything to reduce the performance to a minimum.

I am using Laravel, so this would be the query with Eloquent.

$relatedAlbums = Album::whereHas('songs.genres', function ($query) use ($album) {
        $query->whereIn('genres.id', $album->genres->pluck('id'));
    })->where('id', '<>', $album->id)
    ->orderByDesc('release_date')
    ->take(6)
    ->get();

Note: Previously, the genres were loaded.

If you want to recreate the tables and some fake data in your database, here is the structure

8
  • 1
    Just want to point out a few things: 1. The schema provided is not complete since there is no release_date field on any table. 2. You executing a query with $album->genres->pluck('id'). 3. You should try running EXPLAIN on each individual query to make sure they are using an index. Commented Aug 29, 2020 at 4:58
  • 1. You are right, I wanted to keep it simple for the question, the truth is there are many fields in each table. 2. In the question I made it clear that the genres had been loaded before, I need it like that. Therefore, $album->genres does not make another query. 3 Since the beginning I have been doing it. Only the indexes don't work with EXISTS. That's why I'm here, to seek help. Commented Aug 29, 2020 at 5:50
  • 1
    Run explain on each individual query, simply sorting by a field field without an index will make your query slow. Commented Aug 29, 2020 at 6:11
  • Why not provide some sample data and a desired amount result? Commented Aug 29, 2020 at 6:59
  • Dont think EXISTS is the bottleneck here. Mysql EXISTS is pretty performant. I would follow @Pablo's advice, and maybe share the result for us to have a look? And how large of a dataset are we talking about? Also, you mentioned that there are many fields. Depending on the type of fields, you might get a little edge by selecting only the required fields in the subqueries. Commented Aug 29, 2020 at 7:06

4 Answers 4

3

It is hard to do guesses without seing the real data... but anyways:

I think the problem is that even if you LIMIT the required rows to 6, you have to read ALL the albums table, because:

  • You are filtering them by a non-indexed column
  • You are sorting them by an non-indexed column
  • You don't know which albums will make the cut (will have a song for required genre). So you calculate all of them, then order by release_date, and keep top 6

If you accessed the albums in a sorted published status and published date, once you get first 6 albums that make the cut, mysql can stop processing the query. Of course, you may have 'bad luck' and perhaps the albums that have genre-6 songs are the oldest-published ones, and thus you will have to read and process many albums anyways. Anyways, this optimization should not hurt, so it is worth trying, and one should expect the data to be somewaht eventy distributed.

Also, as stated on other answers, you don't actually need to access the geres table (abeit this is not probably the worst problem of the query). You may just access genre_song and you may create a new index for the two columns you need.

create index genre_song_id_id on genre_song(genre_id, song_id);

Note that previous index only makes sense if you change the query (As suggested on the end of the answer)

For the albums table, you may create any of those two indexes:

create index release_date_desc_v1 on albums (published, release_date desc);

create index release_date_desc_v2 on albums (release_date desc, published);

Choose the whatever index is better for your data:

  • If the percentage of published albums is "low" you probably want to use _v1
  • Else, _v2 index will be better

Please, test them both, but don't let both indexes coexist at the same time. If testing _v1, make sure you dropped _v2 and vice versa.

Also, change your query to not use genre table:

SELECT *
FROM `albums`
WHERE EXISTS
    (SELECT *
     FROM `songs`
     WHERE `albums`.`id` = `songs`.`album_id`
       AND EXISTS
         (SELECT *
          FROM `genre_song`
          WHERE `songs`.`id` = `genre_song`.`song_id`
            AND `genre_song`.`genre_id` IN (6)))
  AND `id` <> 37635
  AND `published` = 1
ORDER BY `release_date` DESC
LIMIT 6;
Sign up to request clarification or add additional context in comments.

8 Comments

Great answer, Your last suggestion of join, has an issue of duplicate rows. so need to use distinct a.*. And I'm thinking is that also a performance drawback? (I practically ran this query)
This query works quite fast, it has been reduced to 28ms. However, as Tharaka Dilshan mentioned, there is a problem of duplicate albums. I tried the query without creating any indexes, do you think it would improve this time if I add these indexes? I have also edited my answer by adding the structure plus the test data in case you want to take a look and do some tests.
I have already solved the problem. Even with the original query, just by adding the indexes you indicated, the query has been reduced to only 1.75ms. Thank you very much, and great answer.
The important thing is to have an index over the column you intend to sort. That way, mysql can 'stop' the query once it gets 6 rows that make the cut. Adding published at the end makes some improvement, since you get a covering index. When you say you sort by release_date and created_at, is that really and AND? (That is you do ORDER BY release_date, created_at or is it an OR? You sometimes ORDER BY release_date or sometimes by ORDER BY created_at? @MrEduar
About Descending indexes: Mysql < 8 does not really have DESC indexes.When creating the index, the DESC clause is ignored. Anyways, it can use an ASC index as DESC (with some performance penalty). MySQL >= 8 does have and can use DESC indexes. So if you really need to sort descending, you should better desfine the index as DESC for the required columns
|
1

One thing I noticed is that you don't have to join the genres table, In the following subquery

AND EXISTS
     (SELECT *
      FROM `genres`
      INNER JOIN `genre_song` ON `genres`.`id` = `genre_song`.`genre_id`
          WHERE `songs`.`id` = `genre_song`.`song_id`
              AND `genres`.`id` IN (6))

We can simplify this and following could be the whole query.

SELECT *
FROM `albums`
WHERE EXISTS
    (SELECT *
     FROM `songs`
     WHERE `albums`.`id` = `songs`.`album_id`
       AND EXISTS
         (SELECT *
          FROM `genre_song`
          WHERE `songs`.`id` = `genre_song`.`song_id`
            AND `genre_song`.`genre_id` IN (6)))
  AND `id` <> 37635
  AND `published` = 1
ORDER BY `release_date` DESC
LIMIT 6

Comments

1

Sure you have to optimize your query for quick response time but here is another tip which can rocket your response time.

I had face the similar problem of slow response time and i have managed to reduce it substantially by simply using cache.

You can use redis driver for cache in Laravel, it will save you from querying the database again and again so your response time will automatically be improved,since redis stores the query and its results in key value pair so next time you are making the api call will return the results from cache without querying the database. Using the redis driver for cache will give you one brilliant advantage which i love.

You can use cache tags

Cache tags allow you to tag related items in the cache and then flush all cached values that have been assigned a given tag.So for example you have an api which retrieves posts of user having $id=1 then you can dynamically put data into cache tags so that next time querying the same record will speed up the response time and if you want to update the data in database you can simply update it to cache tags as well.You can do some thing like the following

public $cacheTag = 'user';

// checking if the record exists in cache already then retrieve it from cache
//other wise retrieve it from database and store it in cache as well for next time 
//to boost response time.
$item = Cache::tags([$cacheTag])->get($cacheTag.$id);
       if($item == NULL) {
           if(!$row) {
               $row = $this->model->find($id);
               
           }
           if($row != NULL || $row != false) {
               $item = (object) $row->toArray();
               Cache::tags([$cacheTag])->forever($this->cacheTag.$id, $item);
           }
       }

While updating data in database you can delete the data from cache and update it

if($refresh)
 {
    Cache::tags([$cacheTag])->forget($cacheTag.$id);
 }

You can read more about cache from Laravel's documentation

Comments

1

FWIW, I find the following easier to understand, so I would want to see the EXPLAIN for this:

SELECT DISTINCT a.*
  FROM albums a
  JOIN songs s
    ON s.album_id =  a.id 
  JOIN genre_song gs
    ON gs.song_id = s.id 
  JOIN genres g
    ON g.id = gs.genre_id
 WHERE g.id IN (6)
   AND a.id <> 37635
   AND a.published = 1
 ORDER 
    BY a.release_date DESC
 LIMIT 6

In this instance, (and assuming the tables are InnoDB), an index on (published,relase_date) might help.

4 Comments

Here is a explaint of this query. Like the other answers, this one also duplicates the results.
If you want DISTNCT results, just use DISTINCT
I used it but, the distinct makes the query increase and I get the same time as the problem.
An index on (published,relase_date) might help. Edited.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.