Thinking about Cloud Computing
A few weeks ago I had a fascinating discussing with a guy about usages of cloud computing. Over the course of an hour, we covered quite a lot of very interesting options.
The traditional drive for using cloud computing is the desire to avoid having to manage your own infrastructure, and the problems that one have to face when dealing with expanding one's infrastructure. Being able to hand all of that to a 3rd party is a very attractive offer, especially if you feel like this 3rd party is someone that you can rely on to do better job in managing infrastructure than your own developers.
In general, developers are not IT professionals, and it shows. An additional benefit of having a 3rd party manage your infrastructure is that they tend to place limitations on the things that you do. These limitations are usually good thing, since they force you to design in a scalable manner.
Amazon's EC2 doesn't give you a persistence storage other than S3 and SimpleDB. This means that bringing up a new VM instance is extremely easy, and you have just increased your capacity. Same for Google's App Engine and the limitations it has.
Of course, that is just the common motive. The competitive price of cloud computing vs. the high cost of building your own data center (or hosting your servers in someone else's) is also a reason to move in that direction.
There are other scenarios, equally interesting, however, of using cloud computing.
Consider the case of a spike in traffic, you are running an online shop and is it a major holiday. You are likely to see a huge surge in traffic. Using your own data center would force you to build to the maximum expected capacity, which is often order of magnitude higher than your normal capacity. Using cloud computing, you can simply turn on additional instances on as needed basis.
But even that benefit was beaten to death. But what about this scenario?
You need to test your application, and as usual, it is hard to find a test environment that has 30 servers that you can try your latest version on. Setting up something like that on EC2 for a week will set you back by less than 400$.
Periodic batch processing is another issue that you might need to consider using cloud computing for. I know of several places where the need to do heavy duty processing (diamond image process for cutting machines, for example), takes quite a bit of computing power, but this is something that you can batch and do and on "as needed" basis, reducing your infrastructure cost significantly. Payroll is also a fairly intensive process that happens only periodically.
Another use case is to setup cloud machines as clients in a distributed load test. You can set it up so a large number of clients (a thousand machines, which you probably will not be able to do on your own) against your application.
The ability to easily perform a rolling update is also attractive, not to mention that
The picture is not so rosy when you consider that it also has some not insignificant minuses.
Broadly, there are three types of cloud services.
- Google App Engine - you upload the application, and that is it. You don't even have the concept of a machine in this system. Scalability and the distribution is solely the issue of the App Engine, not your code. This is currently limited to Python only, and the environment is modified to ensure that you can only do things the Right Way. A lot of the concerns that I intend to list are not relevant for this scenario, because you don't have the concept of a machine. Broadly, I think that this is the way to go, and given the chance to build my own cloud computing application, I would go with a very similar concept.
- Amazon EC2 - you upload a vm image, and you can start creating instances of it. The VM image can contain whatever you want, but you have no persistent storage. That means that you can't actually save data to the local disk or run a RDBMS server. Persistent data is handled using Amazon's web services (S3, SQS, SDB). More on that later.
- GoGrid - You take an existing VM template from their site, customize it, and start running it. This is the closest that you can achieve with regards to data center in the cloud, because GoGrid's systems behave just like real machines. That is, you don't have to be worried about persistent storage and the like. On the one hand, it is very convenient. On the other hand, I am not sure that I like this.
The main fault that I find with Google App Engine at the moment is the limitation to Python. From all other perspectives, it is as close to the model of the ideal close service as you can get. All the infrastructure concerns has been stripped away, you only have to deal with the application concerns.
EC2 and GoGrid both allows me to setup a VM and start running it. EC2's no persistence model means that it is much easier to scale by creating new instances and using the Amazon services to handle storage. GoGrid's model means that I have a lot more flexibility, but with it I have the chance of major issues. In particular, it seems that it is more complex to clone a machine a hundred times than it is on EC2.
EC2 worry me somewhat, because I am not sure how it is handling such things as configuration change (remember, no persistence, on reboot, all changes to the system are wiped), and how I handles patches and updates to the system itself. Perhaps it is because I am working on Windows so often, so I worry about how to deal with Patch Tuesday, but even on the Linux images that are common on EC2, there would be a need to perform such an update. On the EC2 model, that would require getting the image, making the change locally, and uploading the image.
GoGrid, however, will allow me to perform the update in place, but by the same token, I would need to perform this update on all machines in my application.
The EC2 model means that it is very easy to bring up new instances, the GoGrid model means that it is likely to be harder, because instances are not frozen images, they are actual servers, which has state.
Other issues that would concern me in such a scenario would be load balancing, including auto discovery of failed or new instances. From a cursory check, both Amazon and GoGrid have at least rudimentary support for this, but that leaves some things to be desired. In particular, the requirements of load balancing are:
- Distribute loads among a cluster of application servers.
- Handle failover of an application server gracefully.
- Ensure the cluster of servers appears as a single server to the end user.
A short search in the Amazon forums has raised several issues regarding load balancing in the EC2 system. It seems like the common method is to have an EC2 instance running either HA Proxy or round robin DNS deal with this. Nothing is said about what happens if that instance goes down, but I think that I can guess...
This is partially why I think that the Google App Server model is preferred. This model is explicitly an unabashedly tells you that you have no business managing the infrastructure. That is left for someone else to manage. In this case, Google. But even if you decide to build such a system on your own, dealing with infrastructure concerns should be way out of the code.
In other words, you would need to create provisioning system that look at the load, create / destroy instances, update routing information, etc. Not hard, I think, but most certainly tiresome.
Thoughts?
Comments
Oren,
I liked your internal discussion about the merits of the various services. I can answer some of your concerns about GoGrid. But first, just a bit of framework. Both GoGrid and EC2 are what many people consider (including myself) to be Cloud Infrastructure providers while Google App Engine is more of a Cloud Platform providers. The advantages of choosing an Infrastructure provider is that you have more control (e.g., not being limited to using Python only). EC2 does currently limit you to using Linux (you could do Windows but it woudl be through virtualization on top of virtualization).
Some quick things about GoGrid:
-Our load balancer is actually hardware-based f5 load balancers. We developed a method to hook the GoGrid infrastructure into the physical f5 boxes. So, your concerns about a true load balancer are pretty much answered.
Cloning - this is a feature that will be coming this year. You will be able to select a particular image and clone instances of it.
API - if you pair our REST-like API with the cloning, you can programmatically script your cloning process which would make it very easy.
Feel free to drop me a note if you have further questions. Liked your post!
-Michael
Technology Evangelist for GoGrid
Its significant that GAE does not cost a dime for 500 MB disk and approximately 10K unique visitors/mo.
The python limitation could be beneficial in that its pushing a lot of developers to learn an interesting language.
Keeping EC2 up-to-date is easy. Just log in once a month, apt-get update, apt-get upgrade, and run a script to burn your new image to S3, done. Also, you setup your instance to pull a shell script off of S3 at boot. That shell script can do things like install updates, start/stop services, configure the machine, pull more files off of S3, you name it.
Pretty simple really. It's a fantastic service. The lack of permanent disk storage is somewhat annoying if you're very database heavy. If you're a smaller shop, it can be costly to run multiple instances so that you have database redundancy, but as your company grows it becomes very compelling and isn't really an issue anymore.
I think that one of the concepts of cloud computing is the ignorance of what hardware you are running on. Its just taken care of for you. If your app happen to be needing more resources - that will be allocated.
GoGrid/EC2 is more of a hosting service with some add-ons (not that there is anything wrong with that..).
EC2 has persistent storage - http://aws.typepad.com/aws/2008/04/block-to-the-fu.html - they've had it since April 08. It works really well.
Also consider taking a look at RightScale's features - http://www.rightscale.com/m/features.html - they add a lot of functionality on top of EC2. A (poor) analogy is .NET and P/Invoke. RightScale is .NET. But you can always use EC2 directly (P/Invoke) to get it just right.
Michael,
Thanks for the correction infrastructure vs. computing
Consider also Microsoft BizTalk Services hosted workflows: http://labs.biztalk.net/Workflow.aspx. Its model is somewhat close to the Goggle's one.
I wouldn't consider Windows Workflow Foundation to be a goo development environment
Ayende,
Great post. I've been doing some R&D on cloud computing for an application I'm designing and have run through a lot of the same thought processes as you have. In my research, I found this company Gigaspaces, which sits a middlelayer on top of EC2 which allows you to run Java and .NET apps (http://www.gigaspaces.com). they even have a program for startups (under $5M) which allow you to run their services for free.
I really wish MS woul get into the game with a true cloud/grid computing with their Hyper-V tech, SQL 2k8 and .NET.
@Ayende
Don't use MS designers, use Boo or whatever you like. BizTalk Services Workflow is just a XOML file. How you create one is up to you.
Thanks you, writing lot of XML is not part of what I consider as productive activity
"EC2 ...on reboot, all changes to the system are wiped"
Actually, this is incorrect. You can reboot and nothing is lost. Data in the instance disk is lost if the instance is terminated, whether intentionally or not.
Can you define the difference between termination and restart?
You can reboot an instance with a regular linux command like shutdown -r (through ssh) or using the amazon tools with ec2-reboot-instances.
And to terminate an instance shutdown -h or ec2-terminate-instances. Once the instance is terminated, it's gone and can't be restarted.
Ariel,
If I add a new instance, it is blank, right?
If I remove an instance, all data is lost, right?
Thanks for the clarification, btw. I think that the main issue is still there, in such a system, you can't really rely on the disk
Comment preview