Database Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
 
User Name:
Password:
Remember me
 



Go Back   Dev Articles Community ForumsDatabasesDatabase Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Display Modes
 
Unread Dev Articles Community Forums Sponsor:
  #1  
Old June 21st, 2004, 11:17 AM
Kostko Kostko is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Apr 2003
Posts: 1 Kostko User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Post Weighted random selection in PostgreSQL

I would like to implement a weighted random selection, so that rows with larger k would be selected more often, but the selection would still be random.

What would be the best way to implement it ?

Reply With Quote
  #2  
Old June 21st, 2004, 01:23 PM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 16
Send a message via ICQ to dhouston
If there's weighting, it's not really random, is it? The only way I can think of right off to do something like this is to break the larger k into individual units. So in a sample of three items with weights 5, 3, and 2, you'd have ten rather than three items in your random selection, 5 for the item with k 5, 3 for k 3, and 2 for k 2. The greater frequency of the representatives of k 5 would increase the likelihood that that item would be selected. How to do that in a select without either having duplicate rows (rather than weight field k) or having to do some programming logic I can't suggest.
__________________
Please don't PM me asking for solutions outside the scope of a thread.
Keeping all responses in a thread stands to help others who come along later,
which is after all what this forum's all about.

Reply With Quote
  #3  
Old July 24th, 2004, 05:24 PM
jschmitt jschmitt is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jul 2004
Posts: 3 jschmitt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
there are plenty of places that a weighted-random selection is useful (those dont need to be listed here)...

multiple entries can break table structure (relationships and unique keys), this allows you to do so with a "weighting" field.

SELECT [FieldsList]
FROM [TableName]
WHERE [WhereStatements]
ORDER BY Rand()*(1/Weight) LIMIT 1;

Assuming your weight field is an int, larger value has more weight, and named "Weight"

Reply With Quote
  #4  
Old July 26th, 2004, 08:14 AM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 16
Send a message via ICQ to dhouston
I don't see that this ever allows a lower-weighted result to be returned. For a given random number, rows with a higher weight will always evaluate to a lower number than rows with a lower weight and thus will always be returned (because the default is to order ascending). My solution, because it relies on frequency of a given weight, has an increased likelihood of returning a higher-weighted row but does allow lower-weighted rows to be returned. You can get around the table structure issues by having a separate weights table tied to the main table. Maybe I'm missing something in your solution. Can you show me a situation in which a lower-weighted row will ever be returned? If not, then it's a flawed solution.

Reply With Quote
  #5  
Old July 26th, 2004, 08:15 AM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 16
Send a message via ICQ to dhouston
I'm interested in seeing a more elegant solution than mine, incidentally. If you can show me the situation I've requested, I'll gladly admit that I'm wrong and add your solution to my toolbox. I just don't see it as a valid solution yet.

Reply With Quote
  #6  
Old July 29th, 2004, 11:22 AM
jschmitt jschmitt is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jul 2004
Posts: 3 jschmitt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
running it is all that is needed to see a sample:

sample data:
ID Fruit Weight
1 oranges 1
2 apples 3
3 strawberry 2
4 pineapple 1
5 cherry 3
6 peach 2

results:
ID fruit run1(1000) run2(1000) run3(100,000)
1 oranges: 79 (7.9%) 66 (6.6%) 7345 (7.345%)
2 apples: 268 (26.8%) 282 (28.2%) 26965 (26.965%)
3 strawberry: 153 (15.3%) 150 (15%) 15377 (15.377%)
4 pineapple: 77 (7.7%) 73 (7.3%) 7851 (7.851%)
5 cherry: 274 (27.4%) 276 (27.6%) 26418 (26.418%)
6 peach: 149 (14.9%) 153 (15.3%) 16044 (16.044%)

whatcha need to remember is the random number will always be between 0 - 1. this will be multiplied by the inverse of weight... the weighting works because a higher weight returns smaller inverse, helping to achieve a smaller number (query defaults to select in ASC order). the weight is a constant for that line item, it skews the results, not kill them.

technically it would be a hair faster to Rand()/Weight (1 less operation) but it was for the example, because if you used weight, where a lower number had higher precedence, you would remove the inverse like so:
SELECT [FieldsList]
FROM [TableName]
WHERE [WhereStatements]
ORDER BY Rand()*(Weight) LIMIT 1;

this is all dependant on user entered values. there are many variations you could play into this, normalize the weights (say with dated material and TO_DAYS()/DateDiff()/Age()), or a cos()/sin()/ln() ... but that doesnt really belong here


Reply With Quote
  #7  
Old July 29th, 2004, 11:40 AM
dhouston's Avatar
dhouston dhouston is offline
Contributing User
Dev Articles Beginner (1000 - 1499 posts)
 
Join Date: May 2003
Location: Tennessee
Posts: 1,355 dhouston User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 16
Send a message via ICQ to dhouston
Gotcha. I had been thinking in terms of RAND() returning whole numbers rather than decimals between 0 and 1, which obviously changes things a little. Thanks for the good followup. Hope you'll stick around and continue to shed light on the various topics that interest you.

Reply With Quote
  #8  
Old July 31st, 2004, 12:04 PM
jschmitt jschmitt is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Jul 2004
Posts: 3 jschmitt User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
glad to help, i'll definately try in the future. this was one of those things i've been searching for... to off-load processing time to my DB server rather than my code, and it seemed there were many more questions for it than answers.

its running well in a few places, i have a few more places to convert in a similar fashion with cos() and ln() functions for weighting.

Reply With Quote
  #9  
Old May 1st, 2011, 08:09 PM
walkoffhomerun walkoffhomerun is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: May 2011
Posts: 1 walkoffhomerun User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 48 m 27 sec
Reputation Power: 0
although this is an old thread it is still very relevant. a search on google for weighted random samples proves it is a very popular search.

there seem to be 2 methods out there. one method is the method used in this post provided by jschmitt which places the Random()*Weight in the Order By section. That will create a resource hit since you are doing a calculation in the Order By section but otherwise produces valid results.

The 2nd option is more straight forward for SQL Server users using the NewID() function and the calculation is made outside of the Order By section. Something along these lines:
SELECT Name, Points, RAND(CAST(NEWID() AS VARBINARY)) * Points AS Weight FROM TableName ORDER BY Weight DESC

But, what ever way you choose NOBODY seems to explain what a Random Weight Sample should be.

For this post "jschmitt" has in his example 6 fruits. However, in order to actually determine what the end results of the weighting SHOULD be you need to add up all the points for the 6 fruits. In his example there are the following points: 1,1,2,2,3,3. If you add all those up you get a total of 12 points to his built in weight numbers.

Now, to determine what the proper weighted distribution should be you would then have to divide each fruits weight by the total weights added together. Thus, every fruit with a weight of "1" has a pre-determined weight of 8.3% (or 1/12). Every fruit with a weight of "2" has a pre-determined weight of 16.7% (or 2/12). Then, every fruit with a weight of "3" has a pre-determined weight of 25% (or 3/12).

So, when anyone evaluates the performance of a weight random sample of a formula you must be able to verify the formula works.

So, in "jschmitt's" example his numbers are pretty close to where they should be EXCEPT the HIGHER weighted fruits have a larger % error, although not much. There is a reason that you need to understand. For most people this error might not be much BUT in his example the numbers are small. If your weight go from small to really large YOUR numbers will be really messed up. If you have some item with a weight of "1" and some with a weight of "100" your numbers will be really bad in the end. There is a reason so read below.

We know that in his example that oranges and pineapples have a weight of "1" which means they SHOULD be around 8.3% and his test results show a 3 run average of 7.28% for oranges and 7.62% for pineapples.

for the strawberry and peaches they have a weight of "2" and SHOULD be around 16.7% and his test results show a 3 run average of 15.23% for strawberries and 15.42% for peaches.

for the apples and cherries they have a weight of "3" and SHOULD be around 25% and his test results show a 3 run average of 27.3% for apples and 27.14% for cherries.

You will notice the fruits with a weight of "3" are FARTHER away from where they should be at near 25%.

The reason is in the calculation used in this example. This formula simply takes a constant RANDOM number and multiplies it by the weight. Remember, in order to be really weighted you have to actually add together ALL the weights together then divide by the total number of items being weighted. Then you can get a calculation of that specific items actual weight is. That is what you are trying to get outputted from the database on a consistent basis, that is also what you are trying to sell and the customer is expecting or you are expecting. But you might notice with this formula that numbers are really not where they are supposed to be if your weights are really large while some are really low. Can you imagine if he had used some fruits as high as 99 or 100 with some fruits at 1 and 2. The difference in calculation would be huge and those fruits with those high weights would be getting TOO MANY outputs and higher than their anticipated calculated output % based on what was shown above on how to calculate where they should be. If something is calculated ahead of time to get 17% outputs is should be within 1% which would be +/- 1/2% or 16.5% to 17.5% or something close to that. The larger the database and the larger the runs the more accurate it should be. With larger number gaps in the weights "jshmitts" formula will have much larger weighted numbers 2-5% higher than they should be which would affect the lower numbers more. You would end up with a angled curve on a chart with the middle numbers more accurate, the lower number with lower output %'s than they should be getting and higher numbers higher than they should be getting.

It would be MUCH better to keep the weights within a small group of numbers to keep this % error number down.

The 2nd formula listed earlier might solve this problem using "RAND(CAST(NEWID() AS VARBINARY)) * Points" for SQL Server users and where Points is a field with weights entered....

at least "jschmitt" actually did the numbers and output them to test them out. the true test would be to run his formula against a large gap of weights.

A true, real working formula would not be affected by a wide range from high to low weights. I would like to see "jschmitt's" formula run with different weights to see where the % errors are on the larger numbers - say weights from 1 to 100.

Reply With Quote
  #10  
Old March 16th, 2012, 01:51 AM
randy2gray randy2gray is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Mar 2012
Posts: 4 randy2gray User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 13 m 57 sec
Reputation Power: 0
My solution, because it relies on frequency of a given weight, has an increased likelihood of returning a higher-weighted row but does allow lower-weighted rows to be returned. You can get around the table structure issues by having a separate weights table tied to the main table. Maybe I'm missing something in your solution.

Reply With Quote
  #11  
Old March 16th, 2012, 02:54 AM
hhvdketi1 hhvdketi1 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Mar 2012
Posts: 5 hhvdketi1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 14 m 57 sec
Reputation Power: 0
Gotcha. I had been thinking in terms of RAND() returning whole numbers rather than decimals between 0 and 1, which obviously changes


Reply With Quote
  #12  
Old March 16th, 2012, 05:03 AM
Nikki66 Nikki66 is offline
Registered User
Dev Articles Newbie (0 - 499 posts)
 
Join Date: Mar 2012
Posts: 1 Nikki66 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 m 31 sec
Reputation Power: 0
I'm interested in seeing a more elegant solution than mine, incidentally. If you can show me the situation I've requested, I'll gladly admit that I'm wrong and add your solution to my toolbox. I just don't see it as a valid solution yet.

Reply With Quote
Reply

Viewing: Dev Articles Community ForumsDatabasesDatabase Development > Weighted random selection in PostgreSQL


Developer Shed Advertisers and Affiliates


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.

© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap