1

I have a dictionary in my database which has over million records and this simple select

select * from Word where languageId = 'en' order by rand() limit 1

randomly selects one word.

The problem is that this request lasts 3-4 seconds which is very long because I have to repeat it many times.

Is there a way to achieve the same thing but much faster?

EDIT - table schema

wordId - integer, auto increment
languageId - varchar (FK), values like cs, en, de, ... 
word - varchar, word itself

Data structure example

wordId   languageId   word
--------------------------
1        cs           abatyše
...
100000   cs           zip
100001   en           aardvark
...
etc

SQL

CREATE TABLE Language (
  languageId VARCHAR(20)  NOT NULL  ,
  name VARCHAR(255)  NULL    ,
PRIMARY KEY(languageId));

CREATE TABLE Word (
  wordId INTEGER UNSIGNED  NOT NULL   AUTO_INCREMENT,
  languageId VARCHAR(20)  NOT NULL  ,
  word VARCHAR(255)  NULL    ,
PRIMARY KEY(wordId)  ,
INDEX Word_FK_Language(languageId),
  FOREIGN KEY(languageId)
    REFERENCES Language(languageId)
      ON DELETE NO ACTION
      ON UPDATE NO ACTION);
5
  • Do you have records ids column? Commented Feb 9, 2012 at 21:30
  • Could you please tell more about what you are trying to do? May be you do not need to perform single queries or your solution could be optimized. Commented Feb 9, 2012 at 21:45
  • @Cheery I want to select one random word and restricted by language - the restriction is essential. Commented Feb 9, 2012 at 22:01
  • why is languageId a VARCHAR(20)? shouldn't it be an enum or, better yet, an int? Commented Feb 9, 2012 at 22:07
  • @CAFxF Because it's the international code (en.wikipedia.org/wiki/ISO_639-1). I didn't want to have an int id and than the code which is also unique. Commented Feb 9, 2012 at 22:16

3 Answers 3

3

If you have an IDs column and the gaps between the elements are not huge (not too many elements were removed, otherwise some elements will be selected more often) then try this query

SELECT * FROM `table` 
   WHERE id >= 
      (SELECT FLOOR( MAX(id) * RAND()) FROM `table` WHERE languageId = 'en' ) 
   AND languageId = 'en'
   ORDER BY id LIMIT 1;

And look at different examples here http://akinas.com/pages/en/blog/mysql_random_row/

ps: I just realized that it works good only without requirement for the languageId, otherwise the gaps in IDs for the same languageId could be huge.

Updated Try this one, it could be in a couple of times faster. I checked it against execution time of your query.. twice faster..

SELECT d.* FROM
  (SELECT @rn:=0 ) r, 
  (SELECT FLOOR(count(*)*RAND()) as rnd FROM `Word` WHERE languageId = 'en') t,  
  (SELECT @rn:=@rn+1 as rn, `Word`.* FROM `Word` WHERE languageId = 'en' ) d 
WHERE d.rn >= t.rnd LIMIT 1

basically it still creates some kind of continuous ids, but without sorting by them.

Last Update This one could be even faster (depends on the random number generated)

SELECT d.* FROM
  ( SELECT @rn:=@rn+1 as rn, w.*, t.rnd rnd FROM 
     (SELECT @rn:=0 ) r, 
     (SELECT FLOOR(count(*)*RAND()) rnd FROM `Word` WHERE languageId = 'en') t, 
     `Word` w 
   WHERE w.languageId = 'en' AND @rn<t.rnd
  ) d 
WHERE d.rn=d.rnd
Sign up to request clarification or add additional context in comments.

17 Comments

This query takes almost the same time as mine
I was thinking of a similar solution. I think the primary keys must be contiguous though, otherwise there is a possibility of not matching the random value. Also, I'm not sure but I think some rdbms will evaluate the subquery for each record.
@Tomas It probably is evaluating the subquery for each record.
The important part I haven't mentioned yet is that the table contains more than one language. So one language has ids from 1 to 500,000 and the other from 500,001 to 1,000,000 and so on (example).
Well, it seems that this query is executed a little bit faster but it looks that it returns only the first (sometimes the second) word of the chosen language. And it does not look very stable for a language with low ids.
|
2

Firstly, make sure your table is properly indexed. Does it have a primary key? Is languageId an index? Make sure it is.

Secondly, are you only interested in the word, and not things like languageId, or other fields in the table? If you are, you need this:

SELECT word_field FROM Word...

Wildcard SELECTs return everything, but you don't need to retrieve data you're never going to use.

Thirdly, are you just running the same query in a loop if you're repeating it many times? Change your LIMIT statement to return more words in one query:

-- for 10 words
... LIMIT 10

You can store this result for later use without having to re-query the database.

Finally, you can run your query, but with EXPLAIN in front of it to get an overview of what MySQL does when you run it.

EXPLAIN SELECT word_field FROM Word...

Using that, you can identify where exactly your query is running slowly.

4 Comments

Good catch, I missed the part about repetition when I read the question. OP should definitely reuse the sorted list (also prevents duplication).
Word field is a good place where to start. Unfortunately I can only select one word in a loop. Otherwise I would have to create a cache in my source code, read more at once, then read from the cache and repeat this process. I also updated my question
@Tomas Yes, you would need to read from a cached result. But, almost every database driver does this for you automatically. When you execute a query it should hold your result set until you release it.
@Tomas optimisations on the database side won't get you anywhere if your client side code is sub-optimal. The whole point of a cache is to do the intensive work at the start so you don't have to do it again later.
0

You could partition the table by the first letter of the word, randomly pick a letter, and then use your existing sort to pick a random word within that partition. Sorting ~50,000 rows should be reasonably fast on a modern server. I think most database sorts are n lg(n) so 1/26th of the records should sort more than 50 times faster. The partition select should be negligible in terms of performance. On the other hand, fuzzyDunlop's comment about reusing the same list will still undoubtedly win out after 50 or so executions. Edit: I think I screwed up my log on the windows calc, so I'm going to go with: It should be more than 26 times faster ;)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.