Evil interview questions: Unique & Random C
Writing in C (and using only the C std lib as building blocks, which explicitly exclude C++ and all its stuff), generate 1 million unique random numbers.
For reference, here is the code in C#:
1: var random = new Random();2: var set = new HashSet<int>();3:
4: var sp = Stopwatch.StartNew();
5:
6: while (set.Count < 1000 * 1000)7: {
8: set.Add(random.Next(0, int.MaxValue));9: }
10:
11: Console.WriteLine(sp.ElapsedMilliseconds);
It is a brute force approach, I’ll admit, but it completes in about 150ms on my end. Solution must run in under 10 seconds.
This question just looks stupid, it actually can tell you quite a bit about the developer.
Comments
Considering http://xkcd.com/221/ returning numbers from 0 to 1M -1 should be fine :) If numbers do not have to be integers I'd sum rand ([0,1[) to every number of the previous random set just make them different at every run :D Am i a cheater?
Assuming nothing about the distribution of these numbers, and without having to print them all, this runs in a fraction of a second.
Heh, code formatting in comments sucks
You have indeed a unique set. But you don't actually test it the set contains 1.000.000 items. If you have dupes from the random number generator then what?
@Ramon - yes it does. That would be the while on the add...
1) We need to call rand() twice for each number because it returns [0, 32k]
2) We have no built-in hash table, so we'll use a bool[1000000] to keep track of the duplicates. I'd now point out the various reasons this is a hack and problematic and ask the interviewer whether he wants me to fix that or whether it is enough.
Easy.
var sw = new Stopwatch(); var store = new SortedDictionary<Guid, int>(); sw.Start(); for(int i=0;i<1000000;i++) { store.Add(Guid.NewGuid(), i); } sw.Stop(); Console.WriteLine("Elapsed ms: {0}", sw.ElapsedMilliseconds);
// show the results ;) int amount = 0; foreach(var kvp in store) { Console.WriteLine("Value: {0}", kvp.Value); amount++; if(amount>100) { break; } }
The idea is to sort on the guid. It's not that efficient, but the question is rather stupid, and it does guarantee 1000000 numbers in (pseudo) random order.
(oops, that formatting went seriously wrong).
Marco, How would you get random distribution properly there? And how do you ensure uniqueness.
Rafal, Can you do a gist?
Ramon, Because it is a unique set, duplicates generated will be ignored.
Tobi, Assume that you need full 32 bits for the numbers, so you can't just have an array for all the items. What do you do then?
Frans, There is a reason I said C, not C#. And your code would produce only numbers 1 .. 1,000,000, I want full 32 bits range.
Integers from 1 to 1M-1 are unique and normally distrubuted, while if I sum a random floating number between 0 and 1 (not included) for each integers from 0 to 1M - 1, I get a set of unique random floating numbers (sort of normally distributed).
It's a cheat I know but it solves the given problem respecting the constraints.
sorry I meant uniformly distributed not normally
The idea is to divide all integer numbers into N chunks. Then we choose a random number from each chunk. Here is the code: https://gist.github.com/SHSE/6791622
http://pastebin.com/VdScTnaa Same idea as Marco's from the first comment
@James: Ah yes! I missed that :) thanks
In that case I'd build myself a crude hash table that is a LinkedList<int>[1000000]. Only adding is needed and I can rely on malloc to build the linked list. That's the simplest hash table possible and can be built as a quick hack.
Alternatively, I'd generate 10m random numbers, sort them with qsort and implement a simple distinct algorithm which is easy for a sorted array. I'd wrap this in a retry-loop in case I do not get 1m unique numbers out of the operation.
Ayende, you have to stop doing that. I really can't resist this kind of algorithmic exercises and this hurts my productivity. Here's another shot at it, this time taking into consideration the requirement that the output must be uniformly distributed in full 32 bit range. I made use of the fact that a cartesian product of two uniformly distributed sets will also be uniformly distributed. I mean, 1000000 is 1000x1000 so I can produce two sets of 1000 unique numbers and combine them.
http://pastebin.com/8NXDUSG5
Sorry if it crashes, didn't really wait until it spits out all these numbers ;)
sorry, the previous version contained an error (two random numbers were incorrectly combined into one in the 'printf' line)
http://pastebin.com/LCedvFNucklwar fecuring
AAARGH!
http://pastebin.com/LCedvFNu
Unless external input is given all solutions will be pseudo-random. Truely random would need to acquire data from a random physical source. For 1m numbers you would need quite a lot of entropy. maybe hook up my webcam to observe a lava lamp and take the least significant bits from some pseudo random pixels. Now the c part....
I know its beside the point but you could speed up the C# segment:
6: while (set.Count < 1000 * 1000) 7: { 8: set.Add(random.Next(0, int.MaxValue)); 9: }
by leveraging the result of HashSet.Add
int todo = 1000 * 1000; while(todo != 0) if( set.Add(random.Next(0, int.MaxValue)) --todo;
or if you don't want the C# to be terser but [arguably] less legible:
for(int todo = 1000 * 1000; todo > 0; ) todo -= (int) set.Add(random.Next(0, int.MaxValue);
errrr... a final version fixing few bugs (incorrect parentheses + uninitialized array that created some undesired entropy in run time)
http://pastebin.com/e9Sf3eJA
I don't know C, but I guess this would be easy to translate
public static int[] getRandoms(int count, int max) { int[] results = new int[count]; var mask = new BitArray(max + 1); int counter = 0; while (counter < count) { int r = random.Next(max + 1); if (!mask[r]) { results[counter] = r; mask[r] = true; counter++; } }
}
@Sergey,
That's not random! With your solution, you get precisely one number from each chunk, so for instance the set[1, 2, 3, whatever..] could not occur, because 1, 2 and 3 are all in the same chunk.
Here is one approach.
Start with a linked list containing all numbers in the range you care about (e.g. -2^31 through (2^31) - 1). Next loop 1000*1000 times. In each loop get a random number between 0 and (2^32) - i, where i = incrementing counter for the loop. Get and remove the number at the position found by the random number, push that number to end of an array for your output.
Yes, you do have to do 1,000,000 "seeks". But it should be fairly constant time. Perhaps there is a better/faster way to randomize the initial set of numbers (random swapping?).
Here's a super short version. It kind of cheats by ensuring the random number is always greater than the last one. But hey, the numbers are still random, and the code is dead simple. :-)
https://gist.github.com/JudahGabriel/6794792
(It's been a long time since I wrote C!)
Judah, But this will generate a random set of _sequential numbers_. That is what we don't want.
I have a sneaky suspicion that one answer to this is to use a 32 bit LFSR and just generate 1000000 numbers, since it's shift register based there can't be any duplicates (assuming you don't start from 0).
Need to think some more about what to do if you're using rand_s instead.
@Geert, good point! It can be fixed by introducing randomness in the chunk allocation logic. Here is the code: https://gist.github.com/SHSE/6795548
An LFSR representing a primitive polynomial will generate numbers from the whole range.
So a primitive order 32 polynomial will have period of 2^32-1 and go through all 32 bit numbers (excluding 0)
https://gist.github.com/jesuslpm/6801439
I'd probablly go with the following :-
http://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/
Can't get much faster than that. The only question you have to ask is how random do you want your random to be :)
https://gist.github.com/xytis/6805873
A similar answer to Judah Gabriel Himango, but with additional features: 1. Ensures that random numbers uniformly distribute among range [0..MAX_INT] 2. Shuffles the obtained set.
With one drawback: If implementation of rand() has RAND_MAX < 1000000, then set shuffle does not work correctly. (May be fixed by using offsets from current location, instead of absolute indices...)
Also, it is actually slower than Keith's answer, but in memory consumption it performs better than C# example.
I haven't done much C (do C# for a living) but I took a stab at it. There are four files denoted by /** NAME.EXT **/ comments.
Didn't write it to fail if it took longer than 10 seconds since the example C# didn't. Also assumed by C std lib you meant C standard libraries, and not just stdlib.h.
http://pastebin.com/AkWDNuXr
I have another implementation, it takes 3 milliseconds to complete.
https://gist.github.com/jesuslpm/6808179
http://pastebin.com/iVDd9cYW
Basic algorithm: generate more values than needed and dedupe them. If there are too few results, fill in the empty part of the buffer and repeat. Shuffle and take the first N values. This wastes 0.8MB of RAM in exchange for being easily understood and easily implemented.
I could have used less extra space, or even no extra space, but that would devolve into O(#rounds * n log n). As is, it'll be rare to have less than the desired number of values in the initial run.
I'd use a segtree with lazy propagation (or a rbtree) to be able to answer the nth unused number in O(log^2 n).
But I'm not messing arround with mallocs today
What do you mean by random here? That each set of 1m unique ints has an equal probability?
I'd cheat.
n_(x+1) = n_x + rnd(1, int32.max/10^6)
for x = 0 to 10^6, n_0 = 0
Actually, n_0 should be -1 if I'm generating random numbers between 1 and int32.max/10^6. And I'd work for x=0 to (10^6 + 1) and drop the first result.
Should probably round the int32.max/10^6 up and the last result should be calculated from the between the previous number and int32.max. otherwise the the possible numbers would be around 0 to int32.max - 10^6.
A simple approach is to maintain a binary search tree of values generated so far. Generate a random integer and check the tree. If the tree already contains the value, retry. Otherwise, add the number into the result array and insert it into the tree.
Since the values are random, balancing the tree wouldn't be necessary in practice.
Also, the retries are expected to be rare, so their impact should be negligible. Even when generating the last 1,000,000th number, the probability of collision with a number already selected is < 0.05%.
As an implementation note, it is unnecessary to allocate memory for each tree node - i.e., we don't need 1,000,000 malloc calls. We can just pre-allocate an array of 1,000,000 tuples [int, int, int], whose meaning is [value, left_child_index, _right_child_index].
Generate sequence 1..1000000, then shuffle the results. https://gist.github.com/notfed/6878293
Comment preview