


Dev Articles Community Forums
> Databases
> Database Development

Weighted random selection in PostgreSQL
Discuss Weighted random selection in PostgreSQL in the Database Development forum on Dev Articles. Weighted random selection in PostgreSQL Database Development forum to discuss the topics of designing, maintaining, and normalizing databases. If you have to store and retrieve data of any kind, this forum is for you.






Dev Articles Community Forums Sponsor:


June 21st, 2004, 10:17 AM

Registered User


Join Date: Apr 2003
Posts: 1
Time spent in forums: < 1 sec
Reputation Power: 0


Weighted random selection in PostgreSQL
I would like to implement a weighted random selection, so that rows with larger k would be selected more often, but the selection would still be random.
What would be the best way to implement it ?

June 21st, 2004, 12:23 PM


Contributing User


Join Date: May 2003
Location: Tennessee
Posts: 1,355
Time spent in forums: < 1 sec
Reputation Power: 15


If there's weighting, it's not really random, is it? The only way I can think of right off to do something like this is to break the larger k into individual units. So in a sample of three items with weights 5, 3, and 2, you'd have ten rather than three items in your random selection, 5 for the item with k 5, 3 for k 3, and 2 for k 2. The greater frequency of the representatives of k 5 would increase the likelihood that that item would be selected. How to do that in a select without either having duplicate rows (rather than weight field k) or having to do some programming logic I can't suggest.
__________________
Please don't PM me asking for solutions outside the scope of a thread.
Keeping all responses in a thread stands to help others who come along later,
which is after all what this forum's all about.

July 24th, 2004, 04:24 PM

Registered User


Join Date: Jul 2004
Posts: 3
Time spent in forums: < 1 sec
Reputation Power: 0


there are plenty of places that a weightedrandom selection is useful (those dont need to be listed here)...
multiple entries can break table structure (relationships and unique keys), this allows you to do so with a "weighting" field.
SELECT [FieldsList]
FROM [TableName]
WHERE [WhereStatements]
ORDER BY Rand()*(1/Weight) LIMIT 1;
Assuming your weight field is an int, larger value has more weight, and named "Weight"

July 26th, 2004, 07:14 AM


Contributing User


Join Date: May 2003
Location: Tennessee
Posts: 1,355
Time spent in forums: < 1 sec
Reputation Power: 15


I don't see that this ever allows a lowerweighted result to be returned. For a given random number, rows with a higher weight will always evaluate to a lower number than rows with a lower weight and thus will always be returned (because the default is to order ascending). My solution, because it relies on frequency of a given weight, has an increased likelihood of returning a higherweighted row but does allow lowerweighted rows to be returned. You can get around the table structure issues by having a separate weights table tied to the main table. Maybe I'm missing something in your solution. Can you show me a situation in which a lowerweighted row will ever be returned? If not, then it's a flawed solution.

July 26th, 2004, 07:15 AM


Contributing User


Join Date: May 2003
Location: Tennessee
Posts: 1,355
Time spent in forums: < 1 sec
Reputation Power: 15


I'm interested in seeing a more elegant solution than mine, incidentally. If you can show me the situation I've requested, I'll gladly admit that I'm wrong and add your solution to my toolbox. I just don't see it as a valid solution yet.

July 29th, 2004, 10:22 AM

Registered User


Join Date: Jul 2004
Posts: 3
Time spent in forums: < 1 sec
Reputation Power: 0


running it is all that is needed to see a sample:
sample data:
ID Fruit Weight
1 oranges 1
2 apples 3
3 strawberry 2
4 pineapple 1
5 cherry 3
6 peach 2
results:
ID fruit run1(1000) run2(1000) run3(100,000)
1 oranges: 79 (7.9%) 66 (6.6%) 7345 (7.345%)
2 apples: 268 (26.8%) 282 (28.2%) 26965 (26.965%)
3 strawberry: 153 (15.3%) 150 (15%) 15377 (15.377%)
4 pineapple: 77 (7.7%) 73 (7.3%) 7851 (7.851%)
5 cherry: 274 (27.4%) 276 (27.6%) 26418 (26.418%)
6 peach: 149 (14.9%) 153 (15.3%) 16044 (16.044%)
whatcha need to remember is the random number will always be between 0  1. this will be multiplied by the inverse of weight... the weighting works because a higher weight returns smaller inverse, helping to achieve a smaller number (query defaults to select in ASC order). the weight is a constant for that line item, it skews the results, not kill them.
technically it would be a hair faster to Rand()/Weight (1 less operation) but it was for the example, because if you used weight, where a lower number had higher precedence, you would remove the inverse like so:
SELECT [FieldsList]
FROM [TableName]
WHERE [WhereStatements]
ORDER BY Rand()*(Weight) LIMIT 1;
this is all dependant on user entered values. there are many variations you could play into this, normalize the weights (say with dated material and TO_DAYS()/DateDiff()/Age()), or a cos()/sin()/ln() ... but that doesnt really belong here

July 29th, 2004, 10:40 AM


Contributing User


Join Date: May 2003
Location: Tennessee
Posts: 1,355
Time spent in forums: < 1 sec
Reputation Power: 15


Gotcha. I had been thinking in terms of RAND() returning whole numbers rather than decimals between 0 and 1, which obviously changes things a little. Thanks for the good followup. Hope you'll stick around and continue to shed light on the various topics that interest you.

July 31st, 2004, 11:04 AM

Registered User


Join Date: Jul 2004
Posts: 3
Time spent in forums: < 1 sec
Reputation Power: 0


glad to help, i'll definately try in the future. this was one of those things i've been searching for... to offload processing time to my DB server rather than my code, and it seemed there were many more questions for it than answers.
its running well in a few places, i have a few more places to convert in a similar fashion with cos() and ln() functions for weighting.

May 1st, 2011, 07:09 PM

Registered User


Join Date: May 2011
Posts: 1
Time spent in forums: 48 m 27 sec
Reputation Power: 0


although this is an old thread it is still very relevant. a search on google for weighted random samples proves it is a very popular search.
there seem to be 2 methods out there. one method is the method used in this post provided by jschmitt which places the Random()*Weight in the Order By section. That will create a resource hit since you are doing a calculation in the Order By section but otherwise produces valid results.
The 2nd option is more straight forward for SQL Server users using the NewID() function and the calculation is made outside of the Order By section. Something along these lines:
SELECT Name, Points, RAND(CAST(NEWID() AS VARBINARY)) * Points AS Weight FROM TableName ORDER BY Weight DESC
But, what ever way you choose NOBODY seems to explain what a Random Weight Sample should be.
For this post "jschmitt" has in his example 6 fruits. However, in order to actually determine what the end results of the weighting SHOULD be you need to add up all the points for the 6 fruits. In his example there are the following points: 1,1,2,2,3,3. If you add all those up you get a total of 12 points to his built in weight numbers.
Now, to determine what the proper weighted distribution should be you would then have to divide each fruits weight by the total weights added together. Thus, every fruit with a weight of "1" has a predetermined weight of 8.3% (or 1/12). Every fruit with a weight of "2" has a predetermined weight of 16.7% (or 2/12). Then, every fruit with a weight of "3" has a predetermined weight of 25% (or 3/12).
So, when anyone evaluates the performance of a weight random sample of a formula you must be able to verify the formula works.
So, in "jschmitt's" example his numbers are pretty close to where they should be EXCEPT the HIGHER weighted fruits have a larger % error, although not much. There is a reason that you need to understand. For most people this error might not be much BUT in his example the numbers are small. If your weight go from small to really large YOUR numbers will be really messed up. If you have some item with a weight of "1" and some with a weight of "100" your numbers will be really bad in the end. There is a reason so read below.
We know that in his example that oranges and pineapples have a weight of "1" which means they SHOULD be around 8.3% and his test results show a 3 run average of 7.28% for oranges and 7.62% for pineapples.
for the strawberry and peaches they have a weight of "2" and SHOULD be around 16.7% and his test results show a 3 run average of 15.23% for strawberries and 15.42% for peaches.
for the apples and cherries they have a weight of "3" and SHOULD be around 25% and his test results show a 3 run average of 27.3% for apples and 27.14% for cherries.
You will notice the fruits with a weight of "3" are FARTHER away from where they should be at near 25%.
The reason is in the calculation used in this example. This formula simply takes a constant RANDOM number and multiplies it by the weight. Remember, in order to be really weighted you have to actually add together ALL the weights together then divide by the total number of items being weighted. Then you can get a calculation of that specific items actual weight is. That is what you are trying to get outputted from the database on a consistent basis, that is also what you are trying to sell and the customer is expecting or you are expecting. But you might notice with this formula that numbers are really not where they are supposed to be if your weights are really large while some are really low. Can you imagine if he had used some fruits as high as 99 or 100 with some fruits at 1 and 2. The difference in calculation would be huge and those fruits with those high weights would be getting TOO MANY outputs and higher than their anticipated calculated output % based on what was shown above on how to calculate where they should be. If something is calculated ahead of time to get 17% outputs is should be within 1% which would be +/ 1/2% or 16.5% to 17.5% or something close to that. The larger the database and the larger the runs the more accurate it should be. With larger number gaps in the weights "jshmitts" formula will have much larger weighted numbers 25% higher than they should be which would affect the lower numbers more. You would end up with a angled curve on a chart with the middle numbers more accurate, the lower number with lower output %'s than they should be getting and higher numbers higher than they should be getting.
It would be MUCH better to keep the weights within a small group of numbers to keep this % error number down.
The 2nd formula listed earlier might solve this problem using "RAND(CAST(NEWID() AS VARBINARY)) * Points" for SQL Server users and where Points is a field with weights entered....
at least "jschmitt" actually did the numbers and output them to test them out. the true test would be to run his formula against a large gap of weights.
A true, real working formula would not be affected by a wide range from high to low weights. I would like to see "jschmitt's" formula run with different weights to see where the % errors are on the larger numbers  say weights from 1 to 100.

March 16th, 2012, 12:51 AM

Registered User


Join Date: Mar 2012
Posts: 4
Time spent in forums: 13 m 57 sec
Reputation Power: 0


My solution, because it relies on frequency of a given weight, has an increased likelihood of returning a higherweighted row but does allow lowerweighted rows to be returned. You can get around the table structure issues by having a separate weights table tied to the main table. Maybe I'm missing something in your solution.

March 16th, 2012, 01:54 AM

Registered User


Join Date: Mar 2012
Posts: 5
Time spent in forums: 14 m 57 sec
Reputation Power: 0



March 16th, 2012, 04:03 AM

Registered User


Join Date: Mar 2012
Posts: 1
Time spent in forums: 20 m 31 sec
Reputation Power: 0


I'm interested in seeing a more elegant solution than mine, incidentally. If you can show me the situation I've requested, I'll gladly admit that I'm wrong and add your solution to my toolbox. I just don't see it as a valid solution yet.

Developer Shed Advertisers and Affiliates
Thread Tools 
Search this Thread 


Display Modes 
Rate This Thread 
Linear Mode


Posting Rules

You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off




