Stupid, stupid, STUPID!

time to read 3 min | 532 words

Hi, my name is Oren, and I am stupid. I forgot that select() ain't broken holds even if I wrote the select().

Let me tell you the story. We are using Rhino ETL to do ETL stuff, and one of the tasks that we had to do was importing an 80 MB file into the CRM, including all sorts of transformations and joins along the way. Until we got to that part, it was working very well, but with the 80 MB file, it read that and then appeared to be hung.

Rhino ETL is a new project, and it is heavily multi threaded. Since the only way that I know to test multi threading is by test of time, I wasn't overly confident that it got it perfectly right, and it looked like the 80 MB file was triggerring some threading issues. After debugging the issue futilely for a while, and after careless applying of lock() with reckless abandon also failed to do the trick, I decided that I had no alternatives, but to rip out the homegrown threading solution that I had there in favor of something with more robustness.

I choose Retlang for that, and I am really happy with the library, but that isn't the story that I have to tell today. As any who dealt with threading knows, they are complex, nasty and hard. Trying to follow ten threads at the same time in not helpful by any meaning of the word. And trying to replace the threading solution with a radically different model is hard. I have a post or two on big refactoring as a result of that, but again, this isn't the story that I have to tell.

After basing Rhino ETL on Retlang, and getting enough of the tests to pass in a satisfactory manner, I turned back to the real issue at hand, and tried to run the 80 MB file through the new Rhino ETL.

It. Showed. The. Exact. Same. Problem.

At that point, I was on the phone, speaking with the hiring managers about getting a thousand monkeys to type the information in, but I kept tapping the keyboard, probably out of a force of a habit. I couldn't believe that two such radically different implementations would exhibit the same problem. I added logging and tracked down everything. I even added color coded logs, just to be sure that I am seeing everything.

I tracked it down to a log statement that basically said:

Join CustomerWithFoodTypeLookup out 0 rows, Left 103,123 rows, Right 0 rows.

Take a look at this, and try to tell me what the problem is. 

We are joining an empty set to a non empty set, giving... an empty set.

The fix, if you can call it that, was to change the join to:

if Left.FooTypeId = Left.OldFoodTypeId or Right.FooTypeId is null

I rewrote the threading layer because I couldn't track down the root cause of a inner join instead of an left outer join. When I found that, I literally got up and started choking the developer responsible for the script. Then I went and hit myself on the head, repeatedly.