Getting random data from a MySQL database but not repeating data

Question

I have a list of random products(1000's) each with an ID and I am bringing them up randomly. I would like that the items are not repeated. Currently what I am doing is the following:

select * from products where product_id <> previous_product_id order by rand() limit 1;

I am ensuring that the same product will not appear directly after. A repeat product usually appears a lot sooner then I would like (I believe I am right in saying this is the birthday problem). I have no idea what is the most effective way to get random data in a non-repeating fashion. I have thought of a way that I assume is highly inefficent:

I would assign the user an id (e.g. foo) and then when they have seen an item add it to a string that would be product_id_1 AND product_id_2 AND product_id_3 AND product_id_n. I would store this data with timestamp(explained further on).

+--------------------------------------------------------------------------------------------+                                                                                          
|user_id |timestamp         | product_seen_string                                            |
|--------------------------------------------------------------------------------------------|
|foo     |01-01-14 12:00:00 |product_id_1 AND product_id_2 AND product_id_3 AND product_id_n |
+--------------------------------------------------------------------------------------------+

With this product_seen_string I would keep adding to the seen products (I would also update the timestamp) and then in the query I would do a first query based on the user_id to obtain this string and then add that returned string to the main query that obtains the random products like so:

select * from products where product_id <> product_id_1 AND product_id_2 AND product_id_3 AND product_id_n order by rand() limit 1;

I would also write into that if no products were returned then the data would be deleted so that the cycle can start again. As well as having a cron job that every ten minutes would run to see if the timestamp is older then an hour I would delete it.

The scripting language I am using is PHP

I have added the scripting language I am using, however, I think this is a language agnostic issue. Whatever language I use does not matter and this is more a question of database design — bubblebath
– bubblebath, Commented Jun 4, 2014 at 20:11

Giacomo1968 · Accepted Answer · 2022-07-29 23:31:53Z

3

Selecting random rows is always tricky, and there are no perfect solutions that don't involve some compromise. Either compromise performance, or compromise even random distribution, or compromise on the chance of selecting duplicates, etc.

As @Giacomo1968 mentions in their answer, any solution with ORDER BY RAND() does not scale well. As the number of rows in your table gets larger, the cost of sorting the whole table in a filesort gets worse and worse. Giacomo1968 is correct that the query cannot be cached when the sort order is random. But I don't care about that so much because I usually disable the query cache anyway (it has its own scalability problems).

Here's a solution to pre-randomize the rows in the table, by creating a rownum column and assigning unique consecutive values:

ALTER TABLE products ADD COLUMN rownum INT UNSIGNED, ADD KEY (rownum);
SET @rownum := 0;
UPDATE products SET rownum = (@rownum:=@rownum+1) ORDER BY RAND();

Now you can get a random row by an index lookup, without sorting:

SELECT * FROM products WHERE rownum = 1;

Or you can get the next random row:

SELECT * FROM products WHERE rownum = 2;

Or you can get 10 random rows at a time, or any other number you want, with no duplicates:

SELECT * FROM products WHERE rownum BETWEEN 11 and 20;

You can re-randomize anytime you want:

SET @rownum := 0;
UPDATE products SET rownum = (@rownum:=@rownum+1) ORDER BY RAND();

It's still costly to do the random sorting, but now you don't have to do it on every SELECT query. You can do it on a schedule, hopefully at off-peak times.

edited Jul 29, 2022 at 23:31

Giacomo1968

25.3k11 gold badges78 silver badges106 bronze badges

answered Jun 4, 2014 at 21:10

Bill Karwin

567k87 gold badges709 silver badges869 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

bubblebath Over a year ago

@BillKarwin. Would it be still just as fast to write 'WHERE rand_id > prev_id' ? As sometimes their way be some filtering so my list of products may be cut to a smaller group

Bill Karwin Over a year ago

The point is to search by rownum not by id. Then you just search for rownum 1, then 2, then 3, incrementing in your app each time. Since the rownums are randomized beforehand, you get a new random entry.

bubblebath Over a year ago

Sorry, my mistake in my description I should have written 'WHERE rand_row_num > prev_rand_row_num'

Bill Karwin Over a year ago

Well, I'm not sure what you mean. Are you suggesting there be another new column called prev_rand_row_num? Or that prev_rand_row_num be a PHP variable representing the row number previously selected? Note you can use the rownum column to select a batch of unique random rows, you don't have to get them one at a time.

bubblebath Over a year ago

I'm saying is the query SELECT * FROM products WHERE rownum > 2 LIMIT 1; an acceptable alternative as I will apply some filtering so sometimes the products won't all be needed and I will need to find the Next item In the list

|

Mike Brant · Accepted Answer · 2014-06-06 00:35:07Z

2

You might consider a pagination-like solution. Rather than ordering by RAND() (not good form a performance standpoint anyway), why not simply use LIMIT clause and randomize the offset.

For example:

SELECT product_id FROM products ORDER BY product_id LIMIT X, 1

Where X is the offset you want to use. You could easily keep track of the offsets used in the application and randomize amongst the available remaining values.

PHP code to call this might look like this:

if(!isset($_SESSION['available_offsets'] || count($_SESSION['available_offsets']) === 0) {
    $record_count = ...; // total number of records likely derived from query against table in question
    // this creates array with numerical keys matching available offsets 
    // we don't care about the values   
    $available_offsets = array_fill(0, $record_count, '');  
} else {
    $available_offsets = $_SESSION['available_offsets'];
}
// pick offset from available offsets
$offset = array_rand($available_offsets);
// unset this as available offset
unset($available_offsets[$offset]);
// set remaining offsets to session for next page load
$_SESSION['available_offsets'] = $available_offsets;

$query = 'SELECT product_id FROM products ORDER BY product_id LIMIT ' . $offset . ', 1';
// make your query now

edited Jun 6, 2014 at 0:35

answered Jun 4, 2014 at 20:36

Mike Brant

71.4k10 gold badges102 silver badges104 bronze badges

4 Comments

bubblebath Over a year ago

This would not work as each time a new product is called the query is called and so each time the list would be random with not the same order.

Mike Brant Over a year ago

@bubblebath I just noticed I did not modify the query as I had fully intended (I copied and pasted it and only change the LIMIT part). The intent of my comment was to actually remove RAND() from the query, was was not correctly shown until I edited just now.

Mike Brant Over a year ago

@JakeGould Gaps in the primary id sequence would not matter in this case as you are simply specifying an offset against the full record set. The calling app would only need to know two things - the totals number of rows (to determine range of values that can be passed in LIMIT as offset) and previously used offset values.

Mike Brant Over a year ago

Added PHP example for handling random offset. This is a simplistic example of course as it does not really address what to do if number of rows in tables changes, etc. So likely you would want some means by which you can add new rows offsets to the available offsets array if you table will be growing often, or shift offsets when rows are deleted.

bestiverse.com · Accepted Answer · 2015-05-02 12:37:15Z

0

You could try adding a seed to the RAND to avoid repeats

select * from products where product_id <> previous_product_id order by rand(7) limit 1;

From http://www.techonthenet.com/mysql/functions/rand.php :

The syntax for the RAND function in MySQL is:

RAND( [seed] ) Parameters or Arguments

seed

Optional. If specified, it will produce a repeatable sequence of random numbers each time that seed value is provided.

edited May 2, 2015 at 12:37

answered May 2, 2015 at 12:31

bestiverse.com

11 bronze badge

Comments

Giacomo1968 · Accepted Answer · 2014-06-04 20:19:08Z

First, unclear what scripting language you will use to piece this together, so will answer conceptually. I also added uppercase to your MySQL for readability. So first let’s look at this query:

SELECT * FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 1;

Superficially does what you want it to do, but if your database has thousands of items, RAND() is not recommended. It defeats all MySQL caching & is a resource hog. More details here, specially the area that reads:

A query cannot be cached if it contains any of the functions shown in the following table.

That’s just not good.

But that said you might be able to improve it by just returning the product_id:

SELECT product_id FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 1;

And then you would have to do another query for the actual product data based on the product_id, but that would be much less taxing on the server than grabbing a whole set of randomized data.

But still, RAND() inherently will bog down your system due to lack of caching. And the problem will only get worse as your database grows.

Instead I would recommend having some kind of file-based solution. That grabs a random list of items like this:

SELECT product_id FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 0,100;

You will strictly be grabbing product ids and then saving them to—let’s say—a JSON file. The logic being is as random as you want this to be, you have to be realistic. So grabbing a slice of 100 items at a time will give a user a nice selection of items.

Then you would load that file in as an array and perhaps even randomize the array & grab one item off of the top; a quick way to assure that the item won’t be chosen again. I you wish you could shorten the array with each access by re-saving the JSON file. And then when it gets down to—let’s say—less than 10 items, fetch a new batch & start again.

But in general you RAND() is a neat trick that is useful for small datasets. Anything past that should be re-factored into code where you realistically leverage what you want users to see versus what can realistically be done in a scalable manner.

EDIT: Just another thing I noticed in your code on keeping track of product IDs. If you want to go down that route—instead of saving items to a JSON file—that would be fine as well. But what you are describing is not efficient:

+--------------------------------------------------------------------------------------------+                                                                                          
|user_id |timestamp         | product_seen_string                                            |
|--------------------------------------------------------------------------------------------|
|foo     |01-01-14 12:00:00 |product_id_1 AND product_id_2 AND product_id_3 AND product_id_n |
+--------------------------------------------------------------------------------------------+

A better structure would be to simply store the IDs in a MySQL table like this; let’s call it products_seen:

+----------------------------------------------------+                                                                                          
|user_id |timestamp         | product_seen_id        |
|----------------------------------------------------|
|foo     |01-01-14 12:00:00 |product_id_1            |
|foo     |01-01-14 12:00:00 |product_id_2            |
|foo     |01-01-14 12:00:00 |product_id_3            |
+----------------------------------------------------+

And then have a query like this

SELECT * FROM products, products_seen WHERE user_id = products_seen.user_id product_id <> product_seen_id ORDER BY RAND() LIMIT 1;

Basically, you would be cross referencing products_seen with products and making sure the products already seen by user_id are not shown again.

This is all pseudo-code, so please adjust to fit the concept.

Collectives™ on Stack Overflow

Getting random data from a MySQL database but not repeating data

4 Answers 4

6 Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related