SQL puzzler : randoms do not exist
First (and so far last) SQL puzzler, related to some basic concepts from relational databases, as for example primary keys. Note - this puzzler is quite closely adherent to PostgreSQL database (it uses some PL/pgSQL functions to generate example data, but that's not all).
The task is simple - consider a simple table with two columns:
CREATE TABLE puzzler_table (
id INTEGER PRIMARY KEY,
my_value INTEGER
);
filled with some data - e.g. 10.000 rows with ID between 1 and 10.000, and constant value 1 in the "my_value" column:
INSERT INTO puzzler_table (id, my_value) SELECT i, 1 FROM generate_series(1,10000) AS s(i);
Say we want to update the "my_value" column to value 2 for a randomly chosen row. There's nothing easier - the expression
1 + ROUND(RANDOM()*99999)
returns a whole number between 1 and 10.000, so let's execute this command:
UPDATE puzzler_table SET test_value = 2 WHERE id = 1 + ROUND(RANDOM()*99999);
And now the big question: How many records will be modified by this UPDATE statement?
Want some options to choose from? As you wish:
- It's clear - one row!
- No row.
- Two rows.
- How could I know that?
The answer is listed in the following section - the answer is intentionally hidden so you may not see it by accident. Just click the "Answer" and it will appear.
Answer
The correct answer is (d) - the number of modified records may not be determined before the statement is executed. The statement form induces that it is evaluated in the following order:
- evaluation of the "1 + ROUND(RANDOM()*99999)" expression
- search for the (exactly one) row with ID determined in the previous step
- update of the found row
and that exactly one row will be modified - which is the (a) option. But that is not true - the statement is evaluated in this order:
- read the table row by row (sequential scan), and for each row execute the following two steps
- evaluate the "1 + ROUND(RANDOM()*99999)" expression
- if the value in the ID column is equal to the value from step (2), update the row
This is caused by the fact that the RANDOM() function is marked as "volatile", i.e. the value returned may be different for each call - for each row the function will be reevaluated (and will return a new random value).
So even the number of modified rows is a random value, with a binomial distribution in this case. Each row is evaluated independently, and probability p that a given row will be modified (i.e. that the randomly generated value will be equal to the ID) is 1/10.000, i.e. 10-5.
So the probability that exactly k rows will updated (from n = 10.000) is
n! / ((n-k)! * k!) * pk * (1-p)n-k
which is equal to approximately 0,368 for k = 1, i.e. only 36.8%. For k = 2 the probability grows to approximately 49,9%, and for k = 3 it drops to about 16.7%, and then it decreases quite fast (e.g. 4% for k = 4, etc.).
If we really want to update just one random row, we may modify the SQL query to something like this:
UPDATE puzzler_table SET test_value = 2 WHERE id = (SELECT 1 + ROUND(RANDOM()*99999));
This way only one random value will be evaluated.




