Thread safety is a bitch.
- Fully working on SVN now, including all the test.
- Lot of work done on the side of the engine, mostly minor fixes, thread safety, refactoring the way rows are passed between stages in the pipeline, etc.
- "Local variables" for transforms and joins - Local per pipeline, so you can keep state between runs
- Joins - Right now it is nested loops / inner join only, since that seems to be the most common scenario that I have. It does means that I need to queue all the data for the join before it can get passed onward in the pipeline. Here is how you define it:
join JoinWithTypeCasting_AndTransformation: if Left.Id.ToString() == Right.UserId: Row.Id = Left.Id Row.Email = Left.Email Row.FirstName = Left.Name.Split(char(' '))[0] Row.LastName = Left.Name.Split(char(' '))[1] Row.Organization = Right["Organization Id"]
It should be mentioned that this is actually not a proper method, I deconstruct the if statement into a condition and a transformation, this should make it easier to implement more efficient join algorithms in the future, since I can execute the condition without the transformation. - Support for distinct, which turned out to be fairly easy to handle, this can handle a full row distinct or based on several columns.
transform Distinct: Context.Items["Rows"] = {} if Context.Items["Rows"] is null key = Row.CreateKey(Parameters.Columns) if Context.Items["Rows"].ContainsKey(key): RemoveRow() return Context.Items["Rows"].Add(key, Row)
What remains to be done?
Well, Rhino.ETL is very promising, but it needs several more engine features before I would say it is possible to go live with it:
- Aggregators - right now there is no way to handle something like COUNT(*), should be fairly easy to build.
- Parallel / Sequence / Dependencies between pipelines / actions - I need a way to specify that this set of pipeline / actions should happen in sequence or in parallel, and that some should start after others have completed. This has direct affect on how transactions would work.
- Transactions - No idea how to support this, the problem is that this basically means that I need to move all the actions that are happening inside a pipeline into a single thread. It also opens some interesting issues regarding database connection life cycles.
- Non database destination / source - I am thinking that I need at a minimum at least File, WebService and Customer (code). I need to eval using File Helpers are the provider for all the file processing handling.
- Error handling - abort the current processing on error
- Packaging - Command line tool to run a set of scripts
- More logging
- Standard library - things like count, sum, distinct, etc. Just a set of standard transforms that can be easily used.
The code is alive and well now, so you can check it out and start looking, I will appreciate any commentary you have, and would appreciate patches even more :-)