On Hadoop
Yesterday or the day before that I read the available chapters for Hadoop in Action. Hadoop is a Map Reduce implementation in Java, and it includes some very interesting ideas.
The concept of Map Reduce isn't new, but I liked seeing the actual code examples, which made it so much easier to follow what is actually going on. As usual, an In Action book has a lot of stuff in it that relates to getting things started, and since I don't usually work in Java, they were of little interest to me. But the core ideas are very interesting.
It does seems to be limited to a very small set of scenarios, needing to, in essence, index large sets of data. Some of the examples in the book made sense as theoretical problems, but I think that I am still missing the concrete "order to cash" scenario, seeing how we take a business problem and turn that into a set of technical challenges that can be resolved by utilizing Map Reduce in some part of the problem.
As I said, only the first 4 chapters are currently available, and I was reading the early access version, so it is likely will be fixed when more chapters are coming in.
Comments
You might want to take a look at DryadLINQ ( research.microsoft.com/en-us/projects/DryadLINQ/). It is a framework that extends LINQ to the Dryad distributed execution environment. Basically you write LINQ queries (including action queries) and they are automatically distributed to a cluster.
Hadoop, to me at least, is more than just a MR implementation.
Hadoop includes a number of useful subsystems, including HDFS (the Hadoop File System). HDFS is a distributed, replicated storage that feeds the splitting/grouping parts of the MR process.
I've been looking at HDFS from a purely low-tech way of long term document storage. Since all of the documents are identified by a key, quick retrieval is easy and the data is replicated across cheap machines. Since I could then build access methods on top using MR to get at the data and filter/query the contents, the infrequent projections of data into some sort of document list/report would be easy to build.
I've been spending more time in Java the past few weeks, and it has been nice to just pull down an OS project and use it instead of constantly thinking "Okay, now this is how they did it in Java, maybe I should port it to .NET"
Mind you, I'm not a convert away from .NET, I just a thriving ecosystem of Java open source projects that are helping us get things done without a lot of pain.
I think most order to cash scenarios don't involve a cluster of processing (though they may be load balanced to some degree) which is why you don't see too many examples like that. The kind of problems google has to solve are very different than most business problems. Unless the business scenario involves huge amounts of data that can't be represented in the normal ways I think you're unlikely to really need all that and the standard stuff will work fine.
pb,
My point was, I want to see the reasons for why you would do that.
Not how you do it, but what you are doing.
I haven't played around with hadoop yet, but it looks like
Amazon has added hadoop as a option to their offers in the cloud.
See link: aws.amazon.com/.../announcing-amazon-elastic-ma...
Comment preview