Amazon Web Services: A Developer Primer, Page 2
SimpleDB and Simple Storage System
SimpleDB is very reliable (won't lose data) and highly scalable (requiring no effort at all from you), and it provides schema-free data storage. SimpleDB also shares some of the inconveniences of the Google App Engine datastore for developers who are used to using relational databases.
Amazon's Simple Storage System (S3) is really the backbone of AWS. It is used to store data files in containers called "buckets." I've heard people say that the sun would explode sooner than they would lose data in S3. Personally, I believe that the sun is more reliable, but S3 is very robust because like EBS it is automatically replicated to multiple availability zones. Even for applications that I deploy to dedicated servers, I usually use S3 to store static data like web application file attachments and database backups.
Elastic MapReduce is a technique for scaling large calculations, using S3 for both storage of input data sets and output results. It is a cost effective way to start up several large servers for an hour or two to quickly process large data sets. I highly recommend it for processing large amounts of log data, for text and data mining, etc.
The term "map reduce" described operations in functional languages such as Lisp for many decades. Today, the term usually refers to Google's algorithm and implementation s that involve very large shared file system management and worker processes that operate on this shared data.
I frequently use the open source Hadoop project, which implements Google's MapReduce ideas. I used to run Hadoop "jobs" on my own home servers but now I almost always use Amazon's Elastic MapReduce service.
Relational Database Service
I have never used Amazon's Relational Database Service (RDS) on a customer job, but I did try it the first day it was released as a public service. RDS is an EC2 instance running a managed MySQL server. What makes RDS attractive is that it provides a single MySQL server with virtually no administration effort; Amazon keeps the server running and backs it up periodically for you.
Personally, I prefer not to use RDS because I prefer PostgreSQL (or PostGIS) as a relational database and managing my own database seems more flexible. Note that it is very easy to back up a database on one of your EC2 instances; I use a cron job with an entry that looks like this (all one line, split here for readability):
15 20 * * 1 (cd /home/devaccount; rm -f mydatabase*.txt; rm -r mydatabase*.zip; \\ /usr/bin/pg_dump -U postgres mydatabase > mydatabase_pg_dump.txt; zip -9 -r \\ mydatabase_pg_dump_monday.zip mydatabase_pg_dump.txt; s3cmd put \\ mydatabase_pg_dump_monday.zip s3://marks_dev_db_backups/)
That said, if a RDS instance crashes in the middle of the night, Amazon will restart it for you. If the EC2 instance running your self-managed database server fails, you will either have to restart it manually and recover from the latest backup, or even better, automatically detect a failure and recover.
Elastic Load Balancing
Elastic Load Balancing (ELB) is a relatively new Amazon service that I have used on only one customer project. ELB takes the place of HAProxy to distribute incoming HTTP requests across multiple servers. While running HAProxy yourself on an EC2 instance that you manage is a good alternative, I think that you might as well simply use ELB and let Amazon manage this infrastructure task for you.
One decision that you need to make is whether you want to partition your application across more than one availability zone. The advantage of using multiple zones is that there is very little chance of your application becoming unavailable if one of Amazon's availability zones goes off line for any reason. The disadvantage is increased system complexity, the cost of the ELB service, and paying for non-local bandwidth charges between availability zones (bandwidth is free inside a single availability zone).
You will have to determine the business cost of potentially having your web application unavailable.
Much has been written on the reliability of cloud deployments. For many types of applications, I think that almost 100 percent uptime is not the point. I think the real point to cloud platforms is that they allow developers to reduce drastically the costs (both development time and deployment costs) of developing innovative new web applications that users will enjoy and want to use.
About the author
Mark Watson is a consultant living in the mountains of Central Arizona with his wife Carol and a very feisty Meyers Parrot. He specializes in web applications, text mining, and artificial intelligence. He is the author of 16 books and writes both a technology blog and an artificial intelligence blog at www.markwatson.com.