Stupid, stupid, STUPID!
Hi, my name is Oren, and I am stupid. I forgot that select() ain't broken holds even if I wrote the select().
Let me tell you the story. We are using Rhino ETL to do ETL stuff, and one of the tasks that we had to do was importing an 80 MB file into the CRM, including all sorts of transformations and joins along the way. Until we got to that part, it was working very well, but with the 80 MB file, it read that and then appeared to be hung.
Rhino ETL is a new project, and it is heavily multi threaded. Since the only way that I know to test multi threading is by test of time, I wasn't overly confident that it got it perfectly right, and it looked like the 80 MB file was triggerring some threading issues. After debugging the issue futilely for a while, and after careless applying of lock() with reckless abandon also failed to do the trick, I decided that I had no alternatives, but to rip out the homegrown threading solution that I had there in favor of something with more robustness.
I choose Retlang for that, and I am really happy with the library, but that isn't the story that I have to tell today. As any who dealt with threading knows, they are complex, nasty and hard. Trying to follow ten threads at the same time in not helpful by any meaning of the word. And trying to replace the threading solution with a radically different model is hard. I have a post or two on big refactoring as a result of that, but again, this isn't the story that I have to tell.
After basing Rhino ETL on Retlang, and getting enough of the tests to pass in a satisfactory manner, I turned back to the real issue at hand, and tried to run the 80 MB file through the new Rhino ETL.
It. Showed. The. Exact. Same. Problem.
At that point, I was on the phone, speaking with the hiring managers about getting a thousand monkeys to type the information in, but I kept tapping the keyboard, probably out of a force of a habit. I couldn't believe that two such radically different implementations would exhibit the same problem. I added logging and tracked down everything. I even added color coded logs, just to be sure that I am seeing everything.
I tracked it down to a log statement that basically said:
Join CustomerWithFoodTypeLookup out 0 rows, Left 103,123 rows, Right 0 rows.
Take a look at this, and try to tell me what the problem is.
We are joining an empty set to a non empty set, giving... an empty set.
The fix, if you can call it that, was to change the join to:
if Left.FooTypeId = Left.OldFoodTypeId or Right.FooTypeId is null
I rewrote the threading layer because I couldn't track down the root cause of a inner join instead of an left outer join. When I found that, I literally got up and started choking the developer responsible for the script. Then I went and hit myself on the head, repeatedly.
 

Comments
Maybe this is a stupid question, but...
Why on earth would you be writing your own ETL tool? Why not just use SQL Integration Services?
Oren don't hit your head too much. The community wouldn't be happy if you do something nasty to your bright head. :)
Steve,
Because SSIS is a waste of time and a fountain of frustration
@The Other Steve, read earlier posts.. Oren talks about problems/frustrations with SIS. Also Oren would rather write it himself, even when it's a trivial binary converter which is freely available on the web ;-)
Hey, look at the bright side - you can rollback all of your work and return to the pre-re-written implementation in a hearbeat!
"When all else fails, take comfort in your source control".
Off topic: do you know any Israelis (other than yourself) who are going to attend DevTeach? Is there a group or something? I might be going and I don't want to be alone :)
Ok, ok, granted. And sadly it's not just SSIS. We use Informatica here, and I'm amazed that the guys that run this thing can't duplicate a job against a different database without spending like 8 hours redoing everything.
Actually,
That work is something I probably would have done anyway.
I think that it is a very good one.
But the reasoning for that is maddening.
I don't know of anyone who is attending from Israel aside from me.
Humm... if the process is hanging because a query returned no rows, doesn't it mean there's still a bug in there somewhere in Rhino ETL? :) (probably nothing to do with threading, though!)
I have nothing significant to add but my eight-year-old daughter, who was hovering as I went through my feeds, insisted that I tell you it isn't polite to say, "stupid, stupid, stupid".
Call me slow, but that's why I keep things simple.
1) Screw any libraries abstract straight on SQL. Just have your C# call SQL and be done with it.
2) Add logging to all your programs from the beginning, not as a retrofit.
3) Simple code = simple debugging = your program spends more time doing what it's suppossed to do without you having to slave over it
The observed behavior was a hang, in practice, it wasn't hang, but closed the current process, we just failed to realize that.
The observed behavior was a hang, in practice, it wasn't hang, but closed the current process, we just failed to realize that.
Mr_Less_Abstraction,
When I need to load and transform data from 10 sources, it is not simple, trust me on that.
The good news is that you proved that the original threading solution works.
But perhaps you should log a warning if the result of any join is empty. This was the first thing I did in my FetchDataFromAdAndWhatNotThenCreateReports-Application: If there is an empty result, no matter where, yield a warning.
To just look on the positive side...
Congrats on a slightly better Rhino ETL :)
I've been doing this since before you were in diapers and a lot of your stuff is harder than it needs to be.
Trust me on that.
Mr_Less_Abstraction,
I am not going to trust an anonymous guy off the internet. Show your credentials, and then we may talk.
Please note that any additional insulting comments will be removed. I am willing to have a discussion, but it will be a respectable one.
That is interesting, and while I don't doubt that this is true, I would love to get a patch from you to any of my stuff.
Put your code where your mouth its, basically.
Just this Friday I was trying rhino-etl with some files with 400,000 records. If I tried it against a Sql Server 2000 database, the process would hang after inserting about 1% of the records. If I tried it against a Sql Server 2005 instance, I would get the following exception:
System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
This exception occurred when executing the IDBCommand on the SendToDabase method for the DataDestination.
I suspected a problem between your threading architecture and the Sql Connection pool, but I didn't have an idea on how to fix it.
I just tested the new version, and it works fine on both scenarios.
I'm sorry if you found my comments insulting as they were not meant to be.
Most people in this field code for 3-5 years then go (or are voluntered) to the management track and forget what dirt under the fingernails is like.
All I ask is that you keep my comment - "keep everything as simple as possible" - in mind if you are still a coder in 20 years.
Comment preview