Question 6 is a trap, a very useful one
In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.
Stop for a moment and ponder that. What do you assume that relative complexity of those questions?
Questions 1 –5 should take anything between 10 – 15 minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.
Question 6 require that you’ll create an index for a 15 TB CSV file, and allow efficient searching on it.
While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.
But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.
Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.
Candidate 1’s code looked something like this:
foreach(var line in File.EnumerateAllLines("big_data.csv")) { var fields = line.Split(','); var email = line[2] File.WriteAllText(Md5(email), line); }
Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.
So that isn’t going to work, but at least he had some idea about what is going on.
Candidate 2’s code, however, was:
// prepare string[] allData = File.ReadAllLines("big_data.csv"); var users = new List<User>(); foreach(var line in allData) { users.Add(User.Parse(line)); } new XmlSerializer().Serialize(users, "big_data.xml"); // search by: var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>() users.AsParallel().Where(x=>x.Email == "the@email.wtf");
So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).
So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.
 

Comments
Mocking these people publicly is pretty harsh, but I have no defence for candidate #2, that's hilarious!
Where do you get these guys? do they not know who you are and the kind of questions you'll ask?
One of the first questions on the interview should be "what do you know about us?" and surely these candidates have no idea about your company.
chunking through MemoryMappedFile
Pretty sure you could do this with just tee, sed, and sort (or parallel). It would be my first attempt before even attempting to write any C#.
I don't have 15TB of disk space free to test :)
Jason, No, you can't do it like that. Sort require access to the entire data set to actually be able to sort.
Sort (BASH) can sort very large files. parallel (same but parallelised) can do the same thing faster.
They both use very little memory as they create temp files to /tmp.
Whether they can sort a 15TB file I don't know.... but certainly 100s of GBs
Jason, I didn't know about that, very interesting, thanks.
File.WriteAllText - If the target file already exists, it is overwritten.
So, Code 1 will loose data if some emails repeat in the file.
The problem appears to be a tradeoff between disk space, memory, and speed. Do you place limitations on any of those, or have a preference for which resource is minimized?
Ben, In the question, I specify what machine this is running on, then letting the candidate try it out.
how about this, sort the file, using disk based merge sort. memory map the file and do a binary search for the require key.
Isaiah, Great, that is the expected solution. You can even skip the mmap, you don't need that.
Do you mean "var email = fields[2]" (candidate #1 line 4)?
Mandawah, yes, sorry about that.
Thanks. I am so anxious about missing the whole point ;-)
When I hire developers I find that the old trusty Fizz Buzz test eliminates about 90% of our candidates. See http://blog.codinghorror.com/why-cant-programmers-program/.
Mike, FizzBuzz is great, but that just gets you into the "can or cannot" program. I'm actually interested in "can program REALLY well"
For me instead of questions there are only empty paragraphs, is this interview test taken from movie "Exam"? :)
The interesting thing to note is that you are on your own for these exercise. Full access to the internet.
Google. It works.
Do the candidates get to the see the questions before they make they're decision?
Ade, They have all the questions, yes.
Empty lines here too. :(
But I'd fail any candidate who'd try writing a line in C#.
Push the thing into a serious DBMS and let it do the job? Hmm... I suspect that wasn't the expected answer...
Comment preview