MySQL query performance optimization

Question

I have a dictionary in my database which has over million records and this simple select

select * from Word where languageId = 'en' order by rand() limit 1

randomly selects one word.

The problem is that this request lasts 3-4 seconds which is very long because I have to repeat it many times.

Is there a way to achieve the same thing but much faster?

EDIT - table schema

wordId - integer, auto increment
languageId - varchar (FK), values like cs, en, de, ... 
word - varchar, word itself

Data structure example

wordId   languageId   word
--------------------------
1        cs           abatyše
...
100000   cs           zip
100001   en           aardvark
...
etc

SQL

CREATE TABLE Language (
  languageId VARCHAR(20)  NOT NULL  ,
  name VARCHAR(255)  NULL    ,
PRIMARY KEY(languageId));

CREATE TABLE Word (
  wordId INTEGER UNSIGNED  NOT NULL   AUTO_INCREMENT,
  languageId VARCHAR(20)  NOT NULL  ,
  word VARCHAR(255)  NULL    ,
PRIMARY KEY(wordId)  ,
INDEX Word_FK_Language(languageId),
  FOREIGN KEY(languageId)
    REFERENCES Language(languageId)
      ON DELETE NO ACTION
      ON UPDATE NO ACTION);

Could you please tell more about what you are trying to do? May be you do not need to perform single queries or your solution could be optimized. — Cheery
– Cheery, Commented Feb 9, 2012 at 21:45
@Cheery I want to select one random word and restricted by language - the restriction is essential. — user219882
– user219882, Commented Feb 9, 2012 at 22:01
why is languageId a VARCHAR(20)? shouldn't it be an enum or, better yet, an int? — CAFxX
– CAFxX, Commented Feb 9, 2012 at 22:07
@CAFxF Because it's the international code (en.wikipedia.org/wiki/ISO_639-1). I didn't want to have an int id and than the code which is also unique. — user219882
– user219882, Commented Feb 9, 2012 at 22:16

Cheery · Accepted Answer · 2012-02-09 23:27:42Z

3

If you have an IDs column and the gaps between the elements are not huge (not too many elements were removed, otherwise some elements will be selected more often) then try this query

SELECT * FROM `table` 
   WHERE id >= 
      (SELECT FLOOR( MAX(id) * RAND()) FROM `table` WHERE languageId = 'en' ) 
   AND languageId = 'en'
   ORDER BY id LIMIT 1;

And look at different examples here http://akinas.com/pages/en/blog/mysql_random_row/

ps: I just realized that it works good only without requirement for the languageId, otherwise the gaps in IDs for the same languageId could be huge.

Updated Try this one, it could be in a couple of times faster. I checked it against execution time of your query.. twice faster..

SELECT d.* FROM
  (SELECT @rn:=0 ) r, 
  (SELECT FLOOR(count(*)*RAND()) as rnd FROM `Word` WHERE languageId = 'en') t,  
  (SELECT @rn:=@rn+1 as rn, `Word`.* FROM `Word` WHERE languageId = 'en' ) d 
WHERE d.rn >= t.rnd LIMIT 1

basically it still creates some kind of continuous ids, but without sorting by them.

Last Update This one could be even faster (depends on the random number generated)

SELECT d.* FROM
  ( SELECT @rn:=@rn+1 as rn, w.*, t.rnd rnd FROM 
     (SELECT @rn:=0 ) r, 
     (SELECT FLOOR(count(*)*RAND()) rnd FROM `Word` WHERE languageId = 'en') t, 
     `Word` w 
   WHERE w.languageId = 'en' AND @rn<t.rnd
  ) d 
WHERE d.rn=d.rnd

edited Feb 9, 2012 at 23:27

answered Feb 9, 2012 at 21:34

Cheery

16.3k45 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

17 Comments

user219882 Over a year ago

This query takes almost the same time as mine

Tim Over a year ago

I was thinking of a similar solution. I think the primary keys must be contiguous though, otherwise there is a possibility of not matching the random value. Also, I'm not sure but I think some rdbms will evaluate the subquery for each record.

Tim Over a year ago

@Tomas It probably is evaluating the subquery for each record.

user219882 Over a year ago

The important part I haven't mentioned yet is that the table contains more than one language. So one language has ids from 1 to 500,000 and the other from 500,001 to 1,000,000 and so on (example).

user219882 Over a year ago

Well, it seems that this query is executed a little bit faster but it looks that it returns only the first (sometimes the second) word of the chosen language. And it does not look very stable for a language with low ids.

|

leemeichin · Accepted Answer · 2012-02-09 21:37:13Z

2

Firstly, make sure your table is properly indexed. Does it have a primary key? Is languageId an index? Make sure it is.

Secondly, are you only interested in the word, and not things like languageId, or other fields in the table? If you are, you need this:

SELECT word_field FROM Word...

Wildcard SELECTs return everything, but you don't need to retrieve data you're never going to use.

Thirdly, are you just running the same query in a loop if you're repeating it many times? Change your LIMIT statement to return more words in one query:

-- for 10 words
... LIMIT 10

You can store this result for later use without having to re-query the database.

Finally, you can run your query, but with EXPLAIN in front of it to get an overview of what MySQL does when you run it.

EXPLAIN SELECT word_field FROM Word...

Using that, you can identify where exactly your query is running slowly.

answered Feb 9, 2012 at 21:37

leemeichin

3,3871 gold badge26 silver badges31 bronze badges

4 Comments

Tim Over a year ago

Good catch, I missed the part about repetition when I read the question. OP should definitely reuse the sorted list (also prevents duplication).

user219882 Over a year ago

Word field is a good place where to start. Unfortunately I can only select one word in a loop. Otherwise I would have to create a cache in my source code, read more at once, then read from the cache and repeat this process. I also updated my question

Tim Over a year ago

@Tomas Yes, you would need to read from a cached result. But, almost every database driver does this for you automatically. When you execute a query it should hold your result set until you release it.

leemeichin Over a year ago

@Tomas optimisations on the database side won't get you anywhere if your client side code is sub-optimal. The whole point of a cache is to do the intensive work at the start so you don't have to do it again later.

Tim · Accepted Answer · 2012-02-09 22:02:58Z

0

You could partition the table by the first letter of the word, randomly pick a letter, and then use your existing sort to pick a random word within that partition. Sorting ~50,000 rows should be reasonably fast on a modern server. I think most database sorts are n lg(n) so 1/26th of the records should sort more than 50 times faster. The partition select should be negligible in terms of performance. On the other hand, fuzzyDunlop's comment about reusing the same list will still undoubtedly win out after 50 or so executions. Edit: I think I screwed up my log on the windows calc, so I'm going to go with: It should be more than 26 times faster ;)

edited Feb 9, 2012 at 22:02

answered Feb 9, 2012 at 21:55

Tim

1,0116 silver badges12 bronze badges

Collectives™ on Stack Overflow

MySQL query performance optimization

3 Answers 3

17 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

17 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related