0

Needless to say i am not proficient at SQL. Now i have to run a query on a table that looks like this :

 id, tp_id, value_1, value_2, value_3, date

This table has 2 entries for each distinct tp_id, with different values. tp_id is a foreign key, which is indexed, in the following table :

 id, external_id

I'm trying to retrieve data as follows :

Get distinct tp_id where value_2 = 2, value_1 = 1 | 2, value_3 = 1, and date < now - 1 year. These conditions must hold true for BOTH entries with matching tp_id

I have tried the following query, but as i understand it the SUM function paired with the JOIN statement makes the query too slow :

SELECT t1.tp_id, t2.external_id
FROM table_1 t1
JOIN table_2 t2 ON t1.tp_id = t2.id
GROUP BY t1.tp_id
HAVING 
  SUM(
    t1.value_2 = 2 
    AND t1.value_1 IN (1, 2) 
    AND t1.value_3 = 1 
    AND t1.date <= DATE_SUB(NOW(), INTERVAL 1 YEAR)
  ) = 2;

Both tables have roughly 2.5M rows.

I'd like to optimize this query or learn a better way to do this, so any help would be welcome. Thanks in advance

EDIT: It appears running this query will be altogether unnecessary. I will therefore close the question, thanks for the answers

8
  • Why are you joining in this query? Are there multiple rows in t2 per t1 or could you just use t1? Commented Jan 26, 2023 at 15:47
  • Also, it's a little unclear to me, is there an index on table_1.tp_id and on table_2.external_id? Commented Jan 26, 2023 at 15:49
  • Yes, there is an index on both columns. I made a mistake in the query i posted, i am joining because t1.tp_id = t2.id NOT t2.external_id. Editing query to reflect this Commented Jan 26, 2023 at 15:52
  • There's a requirement that is not written but that maybe is understandable from the query you wrote. Do you need to return the ID only if both 2 rows match the condition that you wrote, isn't it? Commented Jan 26, 2023 at 16:00
  • Given the update to the query, can you confirm that there is also an index on table_2.id? Commented Jan 26, 2023 at 16:01

1 Answer 1

1

If I got your requirement correctly, something like this might help.

SELECT tp_id
FROM (
    SELECT t1.tp_id,count(*) as count
    FROM table_1 t1
    WHERE
      t1.value_2 = 2 
      AND (t1.value_1 = 1 OR t1.value_1 = 2) 
      AND t1.value_3 = 1 
      AND t1.date <= DATE_SUB(NOW(), INTERVAL 1 YEAR)
    GROUP BY tp_id
) as res 
WHERE res.count = 2

Essentially, I did 3 performance update:

  1. the WHERE condition is applied before the GROUP BY, way more performant than the HAVING
  2. I've used a nested query, but you can also use HAVING COUNT(tp_id) = 2 depending on your MySQL version
  3. 2 boolean checks should be more performant than an IN clause
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the swift answer. I'm in transportation right now but will try this out as soon as i return.
Regarding the second query, wouldnt this return tp_ids for which one of the rows matched the conditions, but not the other ?
Pretty great, the explanation is awesome. I suspected that for the IN clause. I will give an update as i return to my pc. Thanks
I've just tried the query, unfortunately it also times out at ~30s. I'm curious as to what causes this, do you think it might be an issue with the underlying configuration/resources ?
Can you share your db model? So that we can check the primary keys? 2.5M rows are really a lot. Have you considered to prepare the results of your query in another table? Or in another way?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.