Generate random numbers without repetitions

  • I want to generate a list of N different random numbers:



    public static List<int> GetRandomNumbers(int count)
    {
    List<int> randomNumbers = new List<int>();

    for (int i=0; i<count; i++)
    {
    int number;

    do number = random.Next();
    while (randomNumbers.Contains(number));

    randomNumbers.Add(number);
    }

    return randomNumbers;
    }


    But I feel there is a better way. This do...while loop especially makes this ugly. Any suggestions on how to improve this?


    Isn't this a case for the Fisher–Yates_shuffle ?

    @MartinR Not exactly. Fisher–Yates shuffle gives you permutation of **all** elements in given range.

    You are right, I did not read it correctly.

    @MartinR However if you perform this shuffle on a list of distinct numbers and then trim it to expected size, you will have yet another approach

    What's the use case here? What are you actually trying to do? There might be a domain specific application that's better than something generic (and bad).

    Have you heard of linear congruential generators? It could (with the right parameters) allow you to select in a pseudo-random order n distinct numbers between 1 and m, for some m >= n.

    Unfortunately I'm not inspired enough for a full answer, but someone might expand on this comment. The sequence X(n) = ( a*X(n-1) + b ) % m will span all numbers between 0 and m-1 in a pseudo-random order with the right choice of parameters a, b, (and maybe some restrictions on m). Simply run the sequence n times for the n numbers between 0 and m-1 that you need. The standard choice of parameters (used in a lot of libraries) is a = 16807, b = 0 and m = 2^(31) - 1.

    @BrunoCosta: "so if the numbers don't repeat you can't actually call it random anymore, it will have the property of not repeating which is a very unique property": That's not true. "Random" does not imply "independent", let alone "not having any interesting properties". Heck, by your definition of "random", we couldn't even say "random integer", because the property of being an integer would mean it wasn't random . . .

    This post has been auto-flagged by the system for having too many comments. Please continue all discussions in The 2nd Monitor chat room where this question has been discussed a lot already.

    Bob Floyd invented an algorithm for this purpose. See: http://stackoverflow.com/a/2394292/179910. It's substantially better than what you've shown *or* any of the answers currently posted.

    I little suggestion: use the yield keyword to avoid creating a list. http://stackoverflow.com/questions/39476/what-is-the-yield-keyword-used-for-in-c

    @Kao - it would be great if we could discuss this question (and bounty) in chat...

    @JerryCoffin - I looked in to it, the algorithm is good for handling collisions efficiently, but it has limitations in terms of the quality of the randomness of the results.

    @rolfl: I think you've misunderstood its intent. If you follow the links in the comments, you'll find a couple of mathematical proofs of the quality of randomness of the results.

    @JerryCoffin , I believe I am correct, and that, for example, the first value in the result can never be `N`, which means that the randomness of the result is not maintained. The randomness of the selection is maintained, but the order is not random (enough). Can we clear this up in the 2nd Monitor?

    @JerryCoffin User Kache **does not prove randomness of of the results**. It does only prove **that every number has the same probability of appearing in the result**. However the last rangeLenght - count elements **can not** appear in index 0 of the result. Also, as I already mentioned too many times, as requested elements gets closer to range lenght the result becomes closer to a linear sequence where x(n) = x(n-1) + 1. So I'm with rolfl in this one.

    "Generate random numbers without repetitions" - then it's not random.

    @radarbob I had that discussion with ruahk already. Randomness doesn't have to be independent. What this means is that a non repeating sequence can be random. For example: let x1, y1 and y2 be random. x2 = x1 + y1. x3 = x2 + y2. x1, x2, x3 would be a non reapeating random sequence. The kind of randomness you are talking about is indepedent randomness.

    But the thing about randomness is you can never tell.

    @radarbob please continue the discussion in chat. (And, you are mistaken.)

    @Kao It seems you haven't choosen an answer yet. Should you start to think of which answer you should accept?

    @BrunoCosta I will choose an answer tommorow, after the end of a bounty.

    It seems this guy solved you problem in a very nice way. This is what he says in the first line of the post: *In this post I’m going to show a way to make an iterator that will visit items in a list in a random order, only visit each item once, and tell you when it’s visited all items and is finished. It does this without storing a shuffled list, and it also doesn’t have to keep track of which items it has already visited.*

    Did anybody else visit this question because they saw it in the tour and wanted to see if it was real? I sure did

  • rolfl

    rolfl Correct answer

    7 years ago

    Updated answer in response to bounty: See Is that your final answer? at the end, and other changes - basically answer is significantly rewritten.







    To break your problem down in to requirements:




    • you need a set of random numbers

    • the numbers need to be unique

    • the order of the returned numbers needs to be random



    Your current code indicates that the range of random numbers is specified by Random.Next(), which returns values in the [0 .. Int32.MaxValue) range (note, it excludes Int32.MaxValue). This is significant for the purpose of this question, because other answers have assumed that the range is configurable, and 'small'.



    If the range should be configurable, then the recommended algorithm would potentially be much larger.



    Based on those assumptions, let's do a code review...



    Code Style



    do ... while



    The most glaring problems here are the un-braced do-while loop. You already knew it, but this code is ugly:




        do number = random.Next();
    while (randomNumbers.Contains(number));



    It really should be braced:



        do
    {
    number = random.Next();
    } while (randomNumbers.Contains(number));


    This makes the statement clear, and significantly reduces confusion. Always use braces for 1-liners.



    List Construction



    The List class allows an initial capacity to be used. Since the capacity needs to just be count, it makes sense to initialize the list with this capacity:



    List<int> randomNumbers = new List<int>(count);


    Current Algorithm



    This is where the most interesting observations can be made. Let's analyze your current algorithm:




    • Create a container for the results

    • repeat until you have selected N values:

      • Select a random value

      • check if it has been previously selected

      • if it is 'new', then add it to the container




    This algorithm will produce random values, in a random order, and with good random characteristics (no skews, biases, gaps, etc.).



    In other words, your results are good.



    The problem is with performance....



    There are two performance concerns here, one small, the other large:




    1. the do-while loop to avoid collisions

    2. the List container



    do-while performance



    The do-while has a very low impact on performance... almost negligible. This is hotly debated, but, the reality is that you would need a very, very large count before this becomes a problem. The reasoning is as follows:



    Collisions happen when the random value was previously selected. For the specified range of [0 .. Int32.MaxValue), you would need a very large count before collisions actually happened. For example, count would have to be about 65,000 before there was better than a 50% chance that there was even a single collision.



    In a general sense, given a Range of \$N\$, select \$M\$ numbers. If \$M < \sqrt{N}\$ then the probability of a single collision is < 50%. Since the Range is very large, the probability is small.



    Obviously, if the range was small, then the probabilities would be significantly affected. But the range is fixed at Int32.MaxValue, so that's OK.



    Additionally, if the count was large, then the probabilities would also be affected. How large would be very large? Well, you would be running in to very large arrays before you run in to significant problems..... I would venture you are hitting close to \$\frac{Int32.MaxValue}{2}\$ before you run in to a significant issue with performance.



    List performance



    This is without doubt your largest concern. You use the randomNumbers.Contains(number) call to determine whether a value was previously selected. This requires a scan of all previously-selected values to determine. As mentioned, this will almost always return false, and will thus have to scan the entire list.



    As the count value increases, the length of time to perform the Contains will increase at an quadratic rate, \$O(n^2)\$ where n is count.



    This performance problem will become critical much sooner than the random-collision problem.



    Putting it together



    The problem you have in your code is that you are trying to do too much at once, you are using a List because that is your return value, when a HashSet would be better. If you break the problem down in to stages, you will be able to solve things more elegantly.



    If you add a duplicate value to a HashSet, it does not grow, and the operation performance is not dependent on the amount of data in the HashSet (it is \$O(1)\$). You can use the Count of the HashSet to manage the data uniqueness.



    Once you have a clean set of unique random numbers, you can dump them in to a List, then shuffle the list using an efficient shuffle.



    Combining these data structures, in the right way, leads to an overall \$O(n)\$ solution, which will scale fairly well.



    Here is some code, which works in Ideone too. Note, my C# is weak, so I have tried to make the logic clear.



    using System;
    using System.Collections.Generic;

    public class Test
    {
    static Random random = new Random();

    public static List<int> GenerateRandom(int count)
    {
    // generate count random values.
    HashSet<int> candidates = new HashSet<int>();
    while (candidates.Count < count)
    {
    // May strike a duplicate.
    candidates.Add(random.Next());
    }

    // load them in to a list.
    List<int> result = new List<int>();
    result.AddRange(candidates);

    // shuffle the results:
    int i = result.Count;
    while (i > 1)
    {
    i--;
    int k = random.Next(i + 1);
    int value = result[k];
    result[k] = result[i];
    result[i] = value;
    }
    return result;
    }
    public static void Main()
    {
    List<int> vals = GenerateRandom(10);
    Console.WriteLine("Result: " + vals.Count);
    vals.ForEach(Console.WriteLine);
    }
    }


    The above code is my initial recommendation, and it will work well, and scale well for any reasonable number of values to return.



    Second Alternate Algorithm



    The problem with the above algorithm is threefold:




    1. When count is very large, the probability of collision is increased, and performance may be affected

    2. Data will need to be in both the HashSet and the List at some point, so the space usage is doubled.

    3. The shuffle at the end is needed to keep the data in a random order (HashSet does not keep the data in any specific order, and the hashing algorithm will cause the order to become biased, and skewed).



    These are only performance issues when the count is very large. Note that only the collisions at large count will impact the scalability of the solution (at large count it is no longer quite \$O(n)\$, and it will be come progressively worse when count approaches Int32.MaxValue. Note that in real life this will not likely ever happen.... and memory will become a problem before performance does.



    @JerryCoffin pointed to an alternate algorithm from Bob Floyd, where a trick is played to ensure that collisions never happen. This solves the problem of scalability at very large counts. It does not solve the need for both a HashSet and a List, and it does not solve the need for the shuffle.



    The algorithm as presented:




    initialize set S to empty
    for J := N-M + 1 to N do
    T := RandInt(1, J)
    if T is not in S then
    insert T in S
    else
    insert J in S



    assumes that RandInt(1, J) returns values inclusive of J.



    To understand the above algorithm, you need to realize that you choose a random value from a range that is smaller than the full range, and then after each value, you extend that to include one more. In the event of a collision, you can safely insert the max because it was never possible to include it before.



    The chances of a collision increase at the same rate that the number of values decreases, so the probability of any one number being in the result is not skewed, or biased.



    Is this almost a final answer? No



    So, using the above solution, in C#, would look like (in Ideone) (note, code is now different to Ideone):



    public static List<int> GenerateRandom(int count)
    {
    // generate count random values.
    HashSet<int> candidates = new HashSet<int>();
    for (Int32 top = Int32.MaxValue - count; top < Int32.MaxValue; top++)
    {
    Console.WriteLine(top);
    // May strike a duplicate.
    if (!candidates.Add(random.Next(top + 1)))
    {
    candidates.Add(top);
    }
    }

    // load them in to a list.
    List<int> result = candidates.ToList();

    // shuffle the results:
    int i = result.Count;
    while (i > 1)
    {
    i--;
    int k = random.Next(i + 1);
    int value = result[k];
    result[k] = result[i];
    result[i] = value;
    }
    return result;
    }


    Note that you need to shuffle the results still, so that the HashSet issue is resolved. Also note the need to do the fancy loop-condition top > 0 because at overflow time, things get messy.



    Can you avoid the shuffle?



    So, this solves the need to do the collision loop, but, what about the shuffle. Can that be solved by maintaining the HashSet and the List at the same time. No! Consider this function(in Ideone too):



    public static List<int> GenerateRandom(int count)
    {
    List<int> result = new List<int>(count);

    // generate count random values.
    HashSet<int> candidates = new HashSet<int>();
    for (Int32 top = Int32.MaxValue - count; top < Int32.MaxValue; top++)
    {
    // May strike a duplicate.
    int value = random.Next(top + 1);
    if (candidates.Add(value))
    {
    result.Add(value);
    }
    else
    {
    result.Add(top);
    candidates.Add(top);
    }
    }

    return result;
    }


    The above answer is never going to allow the first value in the result to have any of the Max - Count to Max values (because there will never be a collision on the first value, and the range is not complete at that point), and this is thus a broken random generator.



    Even with this collision-avoiding algorithm, you still need to shuffle the results in order to ensure a clean bias on the numbers.






    TL;DR



    Is This Your Final Answer? Yes!



    Having played with this code a lot, it is apparent that it is useful to have a range-based input, as well as a Int32.MaxValue system. Messing with large ranges leads to potential overflows in the 32-bit integer space as well.



    Working with @mjolka, the following code would be the best of both worlds:



        static Random random = new Random();

    // Note, max is exclusive here!
    public static List<int> GenerateRandom(int count, int min, int max)
    {

    // initialize set S to empty
    // for J := N-M + 1 to N do
    // T := RandInt(1, J)
    // if T is not in S then
    // insert T in S
    // else
    // insert J in S
    //
    // adapted for C# which does not have an inclusive Next(..)
    // and to make it from configurable range not just 1.

    if (max <= min || count < 0 ||
    // max - min > 0 required to avoid overflow
    (count > max - min && max - min > 0))
    {
    // need to use 64-bit to support big ranges (negative min, positive max)
    throw new ArgumentOutOfRangeException("Range " + min + " to " + max +
    " (" + ((Int64)max - (Int64)min) + " values), or count " + count + " is illegal");
    }

    // generate count random values.
    HashSet<int> candidates = new HashSet<int>();

    // start count values before max, and end at max
    for (int top = max - count; top < max; top++)
    {
    // May strike a duplicate.
    // Need to add +1 to make inclusive generator
    // +1 is safe even for MaxVal max value because top < max
    if (!candidates.Add(random.Next(min, top + 1))) {
    // collision, add inclusive max.
    // which could not possibly have been added before.
    candidates.Add(top);
    }
    }

    // load them in to a list, to sort
    List<int> result = candidates.ToList();

    // shuffle the results because HashSet has messed
    // with the order, and the algorithm does not produce
    // random-ordered results (e.g. max-1 will never be the first value)
    for (int i = result.Count - 1; i > 0; i--)
    {
    int k = random.Next(i + 1);
    int tmp = result[k];
    result[k] = result[i];
    result[i] = tmp;
    }
    return result;
    }

    public static List<int> GenerateRandom(int count)
    {
    return GenerateRandom(count, 0, Int32.MaxValue);
    }

    Why do you shuffle the results? I'm confused about this because everyone seems to include this in their answers but it isn't in the OP's question

    @TopinFrassi: Because rolfi ordered the numbers (to check for duplicates) but they shouldn't be ordered.

    @TopinFrassi is right he should just return candidates and it would be fine. Removing shuffling part It's only slightly better then op's answer because it uses hashset wich makes it less expensive to find a duplicate. I said this was a problem about shuffling but of course there are many implementations to fit the requirement.

    @BrunoCosta - The HashSet will impose a non-random (though unstructured) order on the data in the list. The OP's code would produce a random order on the result. The shuffle is required to retain the random ordered output. As for the 'slightly better' part, the HashSet will be about N/4 times faster than the List (where N is the `count`), so, for large N, say, 10,000 count, the HashSet will be 2500 or so times faster.... which, I guess, could be 'slight'.

    @rolfl Gotcha. I always forget the fact that hashsets dont guarantee any order. I didnt meant sligthly in order to depreciate the difference it may really have in a real case scenario. But you forgot to mention in your point that you need to make a one more O(n) operation, the shuffling. But obviously, the use of the hashset makes up for it really quickly.

    Oh but wait what I said was right basically you just said that the Op did more then he would need to. And you just implemented a better version of it. As you should do only what you need then returning the set would (also) fullfill the requirement. So @TopinFrassi solution is better.

    Nice one, but I'd go with Jeroen's one-liner `result = result.OrderBy(x => random.Next());` instead of the last `while` loop.

    @CompuChip - the while loop I have is \$O(n)\$ whereas the OrderBy is slower at \$O(n \log{n})\$. This question has generated a lot of chat in the 2nd monitor, and you're welcome to join

    @BrunoCosta - you seem to be under the impression that I am in some sort of competition with TompinFrassi, or Jeroen. Well, there is some truth to that. Jeroen and I had a lot of good-natured discussion about this problem yesterday in the 2nd Monitor (transctipt) and I invite you (or anyone else) to join us to discuss this more

    I should also point out, BrunoCosta, that I did not forget to mention the \$O(n)\$ shuffle operation, and you should know that two consecutive \$O(n)\$ operations are still \$O(n)\$. @TopinFrassi suggested a similar solution, but does not produce results like the OP's, and I was working on this answer before he posted his. I agree that if you remove the shuffle, our answers are mostly the same, but the shuffle is not something that can be removed while keeping the results similar to the OP's. Additionally, I provide different reasoning (review) as to why to use the HashSet.

    I don't see why there's a debate about this answer, it is a very good one and noone should mind that our answers look alike, at least I don't mind. His explanations are different (should I say better) than mine

    @rolfl As the number of randoms requested begins to approach the maximum range the randoms are allowed to be in, your loop which generates the numbers potentially takes longer and longer (if 1,000,000 unique numbers requested in range of 1-1,000,000, there's a 1-in-1,000,000 change of generating the last number--on average this takes 1mil iterations just for the last number). As such, it might be best, when the number requested is greater than half the total range to start with a full set and randomly remove until the set is at the right size.

    Updated my answer in response to the bounty request.

    @rolfl Jerry coffin solution **almost** removes the need for shuffle iff the number of elements is <<< than range length. Well there is still a issue, the last element of rangeLength can't ever appear in first position. So instead of a complete shuffle you can do a partial shuffle.

    @BrunoCosta - that very much depends on the specific implementation used by the hashing function in HashSet, which is not part of the specification. You cannot trust the HashSet for any order-preservation, and anything other than perfect preservation, or perfectly random, is not good enough.

    @rolfl Also as I stated in my answer that algorithm won't work if count = int32.MaxValue. And you are making a wrong comparision in you for.

    @rolfl refrain from using hashset, see my solution to see what I mean.

    I like that answer, lot of effort put in it, +1. I was worried not to see the final, filling the list on the way. But still have one objection: The *randomness* of the final solution is not perfect, would never allow highest value at the beginning (the shuffle helped to avoid that). I would personally use the original patched with HashSet for the test. No more changes (except codestyling). Of course I assume that it is not desired to eat the memory with a list of 2G ints -> 8GB.

    @firda - that's a real problem.... I never noticed that. The shuffle is actually required because ***none*** of the `Max - Count` values will ever be at position 0. The shuffle is required. My 'is that your final answer', is not good enough.

    I realize that anyone reading my last comment will think the solution is "not good enough". The answer was updated (again) after that comment, and the code is now, as far as I know, complete. That's my final answer ;-)

    Why is a `HashSet` better than a `List`? The hash code of an `int` is an `int`, and probably itself, so surely when you do `HashSet.Add(int)` it's just internally doing `_hashes.Contains(int)` and adding a few unnecessary `GetHashCode()` overheads into the bargain?

License under CC-BY-SA with attribution


Content dated before 7/24/2021 11:53 AM

Tags used