7

I need to select some data from MySQL DB using PHP. It can be done within one single MySQL query which takes 5 minutes to run on a good server (multiple JOINs on tables with more that 10 Mio rows).

I was wondering if it is a better practice to split the query in PHP and use some loops, rather than MySQL. Also, would it be better to query all the emails from one table with 150 000 rows in an array and then check the array instead of doing thousands of MySQL SELECTs.

Here is the Query:

SELECT count(contacted_emails.id), contacted_emails.email 
FROM contacted_emails
LEFT OUTER JOIN blacklist ON contacted_emails.email = blacklist.email
LEFT OUTER JOIN submission_authors ON contacted_emails.email = submission_authors.email
LEFT OUTER JOIN users ON contacted_emails.email = users.email
GROUP BY contacted_emails.email
HAVING count(contacted_emails.id) > 3

The EXPLAIN returns: EXPLAIN

The indexes in the 4 tables are:

contacted_emails: id, blacklist_section_id, journal_id and mail
blacklist: id, email and name
submission_authors: id, hash_key and email
users: id, email, firstname, lastname, editor_id, title_id, country_id, workplace_id

jobtype_id

The table contacted_emails is created like:

CREATE TABLE contacted_emails ( 
  id int(10) unsigned NOT NULL AUTO_INCREMENT, 
  email varchar(150) COLLATE utf8_unicode_ci NOT NULL,
  contacted_at datetime NOT NULL, 
  created_at datetime NOT NULL, 
  blacklist_section_id int(11) unsigned NOT NULL,
  journal_id int(10) DEFAULT NULL, 
  PRIMARY KEY (id), 
  KEY blacklist_section_id (blacklist_section_id), 
  KEY journal_id (journal_id), 
  KEY email (email) ) 
ENGINE=InnoDB AUTO_INCREMENT=4491706 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
15
  • 4
    As a general rule, SQL will ALWAYS be faster than PHP. If your query is taking 5 minutes, even with millions of records and multiple joins, I'm betting there's either sub-optimal syntax or a missing index somewhere. You should do an EXPLAIN to check your query's execution plan for further optimization. Commented Aug 21, 2015 at 15:04
  • 1
    You should repost a more specific question showing your query and EXPLAIN output and see if someone can fix it. Commented Aug 21, 2015 at 15:06
  • 1
    Please show the output of EXPLAIN SELECT count(contacted_emails.id)... and SHOW INDEXES IN contacted_emails and SHOW INDEXES IN blacklist and SHOW INDEXES IN submission_authors and SHOW INDEXES IN users Commented Aug 21, 2015 at 15:10
  • 1
    Generally the Group By make slow the query (because it creates a temporary table on filesystem). So, sometimes is better to do the "group by" work via php (generally when the rows to group are less than 50, you must be sure to have enough memory.) Commented Aug 21, 2015 at 15:19
  • 1
    Edit your question,commetns are hard to read Commented Aug 21, 2015 at 15:24

4 Answers 4

3

Your indexes look fine.

The performance problems seem to come from the fact that you're JOINing all rows, then filtering using HAVING.

This would probably work better instead:

SELECT * 
FROM (
    SELECT email, COUNT(id) AS number_of_contacts
    FROM contacted_emails
    GROUP BY email
    HAVING COUNT(id) > 3
) AS ce
LEFT OUTER JOIN blacklist AS bl ON ce.email = bl.email
LEFT OUTER JOIN submission_authors AS sa ON ce.email = sa.email
LEFT OUTER JOIN users AS u ON ce.email = u.email
/* EDIT: Exclude-join clause added based on comments below */
WHERE bl.email IS NULL
    AND sa.email IS NULL
    AND u.email IS NULL

Here you're limiting your initial GROUPed data set before the JOINs, which is significantly more optimal.

Although given the context of your original query, the LEFT OUTER JOIN tables dom't seem to be used at all, so the below would probably return the exact same results with even less overhead:

SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING count(id) > 3

What exactly is the point of those JOINed tables? the LEFT JOIN prevents them from reducing the data any, and you're only looking at the aggregate data from contacted_emails. Did you mean to use INNER JOIN instead?


EDIT: You mentioned that the point of the joins is to exclude emails in your existing tables. I modified my first query to do a proper exclude join (this was a bug in your originally posted code).

Here's another possible option that may perform well for you:

SELECT 
FROM contacted_emails
LEFT JOIN (
    SELECT email FROM blacklist
    UNION ALL SELECT email FROM submission_authors
    UNION ALL SELECT email FROM users
) AS existing ON contacted_emails.email = existing.email
WHERE existing.email IS NULL
GROUP BY contacted_emails.email
HAVING COUNT(id) > 3

What I'm doing here is gathering the existing emails in a subquery and doing a single exclude join on that derived table.

Another way you may try to express this is as a non-correlated subquery in the WHERE clause:

SELECT 
FROM contacted_emails
WHERE email NOT IN (
    SELECT email FROM blacklist
    UNION ALL SELECT email FROM submission_authors
    UNION ALL SELECT email FROM users
)
GROUP BY email
HAVING COUNT(id) > 3

Try them all and see which gives the best execution plan in MySQL

Sign up to request clarification or add additional context in comments.

2 Comments

Hi Steven, thank you for your answer. The LEFT OUTER JOIN is used to exclude emails that are already in the tables USERS, submission_authors and blacklist. I need those emails to be excluded.
@Miloš - In that case, you should use a IS NULL filter to exclude. Editing my answer.
2

A couple of thoughts, in terms of the query you may find it faster if you

count(*) row_count 

and change the HAVING to

row_count > 3

as this can be satisfied from the contacted_emails.email index without having to access the row to get the contacted_emails.id. As both fields are NOT NULL and contacted_emails is the base table this should be the same logic.

As this query will only lengthen as you collect more data, I would suggest a summary table where you store the counts (possibly per some time unit). This can either be update periodically with a cronjob or on the fly with triggers and/or application logic.

If you use a per time unit option on created_at and/or store the last update to the cron, you should be able to get live results by pulling in and appending the latest data.

Any cache solution would have to be adjusted anyway to stay live and the full query run every time the data is cleared/updated.

As suggested in the comments, the database is built for aggregating large amounts of data.. PHP isn't.

2 Comments

If you count email with HAVING, you need to use DISTINCT which is quite slow.
@Mihai Yep, not sure you are entirely correct about DISTINCT but I misread the grouping, I'll take out that suggestion
2

You would probably be best with a Summary table that is updated via trigger on every insert into your contacted emails table. This Summary table should have the email address and a count column. Every insert into contacted table, update the count. Have an index on your count column in the summary table. Then you can query directly from THAT, have the email account in question, THEN join to get the rest of whatever details need to be pulled.

2 Comments

That's not a reasonable solution. If every time we needed to aggregate data, we should create "count" tables, our jobs as programmers would suck. Counts would get out of sync. Marketing would decide they want averages, or counts per month, or whatever. Then we'd have to go redo all the hacked up programming again. That's why SQL exists - to do these complex jobs on the fly, so we don't NEED lists of aggregated figures.
@StevenMoseley, I respectfully disagree. In some cases, and it goes based on the context of the sites in question... or even data-mining in general. If triggers are put in place to update whatever aggregates, roll-ups, etc, querying from that as a basis WOULD be faster. The table is created ONCE, and the triggers on the OTHER table would do the insert/update for you. Once primary criteria is established, then drilling into details would get to the more raw data.
0

Following your recommandations, I was choosing this solution:

SELECT  ce.email, ce.number_of_contacts
FROM (
    SELECT email, COUNT(id) AS number_of_contacts
   FROM contacted_emails
   GROUP BY email
   HAVING number_of_contacts > 3
) AS ce
NATURAL LEFT JOIN blacklist AS bl
NATURAL LEFT JOIN submission_authors AS sa
NATURAL LEFT JOIN users AS u
WHERE bl.email IS NULL AND sa.email IS NULL AND u.email IS NULL

This is taking 10sec to run which is fine for the moment. Once I will have more data in the database, I will need to think about another solution where I will create a temporary table.

So, to conclude, loading an entire table as php array is not good for the performance as making mysql queries.

2 Comments

Did you try changing COUNT(id) to COUNT(*)? I'd be interested to know if it increased performance. Also as you have already done the count you can use HAVING number_of_contacts > 3 in the subquery.
@Arth, changing the COUNT(id) to COUNT(*) has no impact on the performance. However, changing HAVING COUNT(id) > 3 to HAVING number_of_contacts > 3 improved the performance (from 20sec to 10sec). I edited the answer, thanks a lot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.