1

I am using MySQL 5.7.25 and this is the query I am trying to optimize:

SELECT a.contract,
       a.phone_number_1,
       a.phone_number_2,
       a.phone_number_3,
       a.phone_number_4,
       a.phone_number_5
  FROM tempdb.customer_crm a
 WHERE CHAR_LENGTH(a.contract) = 12
   AND (
         a.contract in (SELECT contract_final FROM tempdb.relevant_contracts)
         OR a.phone_number_1 in (SELECT phone_number FROM tempdb.relevant_numbers_1)
         OR a.phone_number_2 in (SELECT phone_number FROM tempdb.relevant_numbers_2)
         OR a.phone_number_3 in (SELECT phone_number FROM tempdb.relevant_numbers_3)
         OR a.phone_number_4 in (SELECT phone_number FROM tempdb.relevant_numbers_4)
         OR a.phone_number_5 in (SELECT phone_number FROM tempdb.relevant_numbers_5)
       );

customer_crm table has 5 different phone numbers in 5 columns. I need to filter all the records where any of the 5 phone numbers exists in table relevant_numbers. I have made 5 copies of table relevant_numbers as I can only use TEMPORARY tables (which cannot be opened more than once in MySQL). The number of records in:

  • customer_crm: 80 Million
  • relevant_numbers: 63 Thousand
  • relevant_contracts: 93 Thousand
  • Result of the query: 100 Thousand

This query takes too long. I have shaved off a few minutes using (phone number length condition):

SELECT a.contract,
       a.phone_number_1,
       a.phone_number_2,
       a.phone_number_3,
       a.phone_number_4,
       a.phone_number_5
  FROM tempdb.customer_crm a
 WHERE CHAR_LENGTH(a.contract) = 12
   AND (
         a.contract in (SELECT contract_final FROM tempdb.relevant_contracts)
         OR (CHAR_LENGTH(a.phone_number_1) > 9 AND a.phone_number_1 in (SELECT phone_number FROM tempdb.relevant_numbers_1))
         OR (CHAR_LENGTH(a.phone_number_2) > 9 AND a.phone_number_2 in (SELECT phone_number FROM tempdb.relevant_numbers_2))
         OR (CHAR_LENGTH(a.phone_number_3) > 9 AND a.phone_number_3 in (SELECT phone_number FROM tempdb.relevant_numbers_3))
         OR (CHAR_LENGTH(a.phone_number_4) > 9 AND a.phone_number_4 in (SELECT phone_number FROM tempdb.relevant_numbers_4))
         OR (CHAR_LENGTH(a.phone_number_5) > 9 AND a.phone_number_5 in (SELECT phone_number FROM tempdb.relevant_numbers_5))
       );

It still takes about 10 minutes. I have tried using EXISTS condition instead of IN and it takes even longer. I have also tried using left join which also takes longer. All the columns are individually indexed.

Any help will be appreciated. Thanks.

1
  • Why can you only use temporary tables? Why don't you use joins? Which indexes exist on the tables? Commented Jan 25, 2020 at 21:13

2 Answers 2

2

OR is a performance killer. So is IN ( SELECT ... ).

The query as it stands is going to do a full table scan of 80M rows, and do lookups into the temp tables. Those secondary lookups will be only 1 row if you go to the effort of indexing your temp tables, or 63K rows otherwise -- That would add up to 25 trillion lookups. It might finish this year.

Plan A: Turn OR into UNION:

    (  SELECT  cc.id
            FROM  tempdb.customer_crm AS cc
            JOIN  tempdb.relevant_contracts AS rc
            WHERE  cc.contract = rc.contract 
    )  UNION  
    (  SELECT  cc.id
            FROM  tempdb.customer_crm AS cc
            JOIN  tempdb.relevant_numbers_1 AS rn
            WHERE  cc.phone_number_1 = rn.phone_number 
    )  UNION
    (  SELECT  cc.id
            FROM  tempdb.customer_crm AS cc
            JOIN  tempdb.relevant_numbers_2 AS rn
            WHERE  cc.phone_number_2 = rn.phone_number 
    )  UNION
    (  SELECT  cc.id
            FROM  tempdb.customer_crm AS cc
            JOIN  tempdb.relevant_numbers_3 AS rn
            WHERE  cc.phone_number_3 = rn.phone_number 
    )  UNION  
    (  SELECT  cc.id
            FROM  tempdb.customer_crm AS cc
            JOIN  tempdb.relevant_numbers_4 AS rn
            WHERE  cc.phone_number_4 = rn.phone_number 
    )  UNION  
    (  SELECT  cc.id
            FROM  tempdb.customer_crm AS cc
            JOIN  tempdb.relevant_numbers_5 AS rn
            WHERE  cc.phone_number_5 = rn.phone_number 
    )

I am assuming that id is the PRIMARY KEY of customer_crm. You will need these indexes on customer_crm:

INDEX(contract, id)
INDEX(phone_number_1, id)
INDEX(phone_number_2, id)
INDEX(phone_number_3, id)
INDEX(phone_number_4, id)
INDEX(phone_number_5, id)

Use the above query as a subquery, JOIN that back to customer_crm to get whatever columns you really need.

That will be on the order of 1 million actions -- much less.

The check for length=12 can come later as a minor annoyance.

Plan B: Don't use 5 columns.

It is usually a bad schema design to have an array of things spread across multiple columns or packed together in a single column. Instead, have another table with (at least) 2 columns: the number and the id to join back to the main table.

With INDEX(number), it won't matter that it has 5*80M rows.

Plan C: Would you care to back up to before creating the temp tables; other optmizations may be possible.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Rick. Your Plan A did wonders. Query execution time reduced about 6 times. As this is client db I cannot implement Plan B or C. I am accepting it as solution.
2

customer_crm table has 5 different phone numbers in 5 columns. I need to filter all the records where any of the 5 phone numbers exists in table relevant_numbers.

Instead of checking individually each phone number in relevant_numbers, why not use exists with an in condition?

select c.*
from tempdb.customer_crm c
where 
    exists (
        select 1
        from tempdb.relevant_contracts o
        where o.contract_final = c.contract 
    )
    or exists (
        select 1
        from tempdb.relevant_numbers n
        where n.phone_number in (
            c.phone_number_1,
            c.phone_number_2,
            c.phone_number_3,
            c.phone_number_4,
            c.phone_number_5
        )
    )

For performance, you can try the following indexes:

customer_crm(
    contract, 
    phone_number_1,
    phone_number_2,
    phone_number_3,
    phone_number_4,
    phone_number_5
)
relevant_contracts(contract_final)
relevant_numbers (phone_number)

I am also unsure that the checks on the length of contract is beneficial: using a function here makes the query non SARGable (ie prevents the use of an index).

4 Comments

contract length is a requirement. i added the optional length checks on phone numbers and it seemed to reduce time. let me try your suggestions and get back. Also, it should be "OR EXISTS" instead of AND. Right?
@Imtiaz: what is the datatype of column contract: string or numeric?
contract is varchar(20), so are all the phone numbers. I can't change the original tables as they are customer's. I can however change types in my temporary tables.
Thanks @GMB for your help. I tried your suggestions but the query execution time remained in the same ballpark.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.