MySQL Python taking too long to query large database

Question

I have a database with over 30,000 tables and ~40-100 rows in each table. I want to retrieve a list of table names which contain a string under a specific column.

So for example:

I want to retrieve the names of all tables which contain 'foo'...

Database
    Table_1
        ID: 1, STR: bar
        ID: 2, STR: foo
        ID: 3, STR: bar
    Table_2
        ID: 1, STR: bar
        ID: 2, STR: bar
        ID: 3, STR: bar
    Table_3
        ID: 1, STR: bar
        ID: 2, STR: bar
        ID: 3, STR: foo

So in this case the function should return ['Table_1', 'Table_3']

So far I have this, it works fine but takes over 2 minutes to execute, which is way too long for the application I have in mind.

self.m('SHOW TABLES')
result = self.db.store_result()
tablelist = result.fetch_row(0, 1)
for table in tablelist:
    table_name = table['Tables_in_definitions']
    self.m("""SELECT `def` FROM `""" + table_name + """` WHERE `def` = '""" + str + """'""")
    result = self.db.store_result()
    r = result.fetch_row(1, 1)
    if len(r) > 0:
        results.append(table_name)

I'm not smart enough to come up with a way to speed this up so if anyone has any suggestions it would be greatly appreciated, thanks!

spencer7593 · Accepted Answer · 2012-07-18 20:33:51Z

3

If you are just testing for the existence of one row in each table where def = 'str', one easy thing to do (with no other changes) is to add a LIMIT 1 clause to the end of your query.

(If your query is performing a full table scan, MySQL can halt it once the first row is found. If no rows are found, the full table scan has to run to the end of the table.)

This also avoids overhead of preparing lots of rows to be returned to the client, and returning them to the client, if they aren't needed.

Also, an index with def as a leading column (at least on your largest tables) will likely help performance, if your query is looking through large tables for "a needle in haystack".

UPDATE:

I've re-read your question, and I see that you have 30,000 tables to check, that's 30,000 separate queries, 30,000 roundtrips to the database. (ACCCKKK.)

So my previous suggestion is pretty much useless. (That would be more appropriate with 40 tables each having 30,000 rows.)

Another approach would be to query a bunch of those tables at the same time. I'd be hesitant to even try more than a couple hundred tables at a time though, so I'd do it in batches.

SELECT DISTINCT 'Table1' AS table_name FROM Table1 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table2' FROM Table2 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table3' FROM Table3 WHERE def = 'str'

If def is unique in each table, or, if it's nearly unique, and you can handle duplicate table_name values being returned, you could get rid of the DISTINCT keyword.

You do need to ensure that every table in the list has a column named def. If you encounter a table that doesn't have that column in it, the whole batch would fail. And a SHOW TABLES doesn't do that check of the column names. I'd be using a query like this to get the list of table names that have a column named def:

SELECT table_name
  FROM information_schema.columns
 WHERE table_schema = DATABASE()
   AND column_name = 'def'
 GROUP BY table_name
 ORDER BY table_name

edited Jul 18, 2012 at 20:33

answered Jul 18, 2012 at 19:56

spencer7593

109k15 gold badges122 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

amba88 Over a year ago

Thanks for your comment, 'def' is unique so like you said, I'm only checking for 1 row in each table. I'll see if the performance is improved with LIMIT 1 and making def a leading column.

spencer7593 Over a year ago

I've updated my answer... I read your question more carefully, and I don't think my first suggestion (using LIMIT 1) is going to help much... you aren't spending time scanning the tables, you're likely spending most of your time making 30,000+ roundtrips to the database to run a query that runs fast. A better approach would be to query multiple tables at the same time, using a UNION ALL approach, and having the query return the table_name the row was found in.

amba88 Over a year ago

Okay great, thank you very much for the suggestion I'll try the UNION ALL approach and see what happens. The table names are all unique and every table has a def column, so I think SHOW TABLES should suffice.

spencer7593 Over a year ago

@Arran: even if you do only ten tables at a time, that will cut the number of roundtrips to and from the database tenfold. I expect you may be able to do substantially more tables than that in a batch. You might even be able to get all 30,000+ tables in one shot, but I shudder, there's some internal limitation you're likely to hit, beyond the max_allowed_packet size. (I'm pretty sure I've ever run a mysql query that referenced more than 100 tables.)

amba88 Over a year ago

Okay, I tried it with a few different numbers of tables in each query and benchmarked each, in case you're interested, here are my results: 100 - ~1s, 200 - ~1s, 300 - ~2s, 400 - 58s, 900 - 1:10s The higher numbers just seem to bottleneck it. I've gone with 150 as it seems about the fastest. I can't thank you enough for your help!

|

Collectives™ on Stack Overflow

MySQL Python taking too long to query large database

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related