I have a table (civicrm_contact) with the following (relevant) columns:
+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| contact_type | varchar(64) | YES | MUL | NULL | |
| first_name | varchar(64) | YES | | NULL | |
| last_name | varchar(64) | YES | MUL | NULL | |
I'm looking to optimize a query that does a partial match comparison of the first/last name fields.
SELECT t1.id id1, t2.id id2, 2 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
ON (SUBSTR(t1.first_name, 1, 4) = SUBSTR(t2.first_name, 1, 4))
AND (SUBSTR(t1.last_name, 1, 6) = SUBSTR(t2.last_name, 1, 6))
WHERE t1.contact_type = 'Individual'
AND t2.contact_type = 'Individual'
AND t1.first_name IS NOT NULL
AND t1.first_name <> ''
AND t1.last_name IS NOT NULL
AND t1.last_name <> ''
AND t1.id < t2.id
I created an index to match the partial matches. There are other existing indexes as well.
CREATE INDEX idx_ct_last6_first4_name ON civicrm_contact (contact_type, last_name(6), first_name(4));
However, the query remains very slow, and when I EXPLAIN it, I can see that the first table does not use that index and the second table does not use any index.
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ref | PRIMARY,index_contact_type,index_first_name,index_last_name,dedupe_index_first_name_4,dedupe_index_last_name_6,idx_last6_first4_name,idx_ct_last6_first4_name | index_contact_type | 195 | const | 46968 | 25.00 | Using where |
| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY,index_contact_type,idx_ct_last6_first4_name | NULL | NULL | NULL | 93936 | 16.66 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
I'm trying to understand why it's not making better use of the indexes and what I can do to improve/optimize the query. It is intended to do a simple comparison against itself to identify duplicate values. Is the lack of index use because I'm joining the table against itself?
If so, I'm guessing my best option is to create a temp table for the second table reference so that I am joining two separate tables, which should make better use of the indexes.