I am an enthusiastic user of cloud computing platforms. As a developer, I’d much rather spend my time actually writing web applications and services and leave from the scalability and deployment issues to someone (or something) else. One cloud deployment platform in particular that I’ve found to be powerful and flexible is Amazon Web Services (AWS).
I use AWS for several customer projects as well for personal projects (Amazon also gives me grants so that I can freely experiment with AWS to support my writing endeavors). With Amazon, you have the best of both worlds: scalable services that you can use to architect your applications, and the flexibility to install and run any software on Amazon’s Elastic Compute Cloud (EC2) instances (a “server unit” implemented using virtual private server technology).
Amazon partitions their massive server farms across different geographical locations called “availability zones.” At the cost of increased network latency and having to pay non-local bandwidth fees, you can make your web applications and web services more robust by partitioning them across more than one availability zone.
In this article, I recalled my professional and personal experiences with AWS to provide a primer for developers who are new to the platform.
Management Console Versus Command Line Tools?
When I first started using AWS, I relied on the Web Management Console for most tasks, including starting and stopping EC2 instances, backing up my file volumes with snapshots, assigning Elastic IP addresses to EC2 instances, etc. However, as I started to use AWS on more consulting jobs, clients almost always asked me to automate as many of the routine administration tasks as possible. For that requirement, you should install the AWS command line tools and learn to use them. You also should install the command line tools on your EC2 AMIs (Amazon Machine Instances) so that they are available on your EC2 instances.
For tasks such as attaching and mounting EBS (Elastic Block Storage) file systems automatically, assigning Elastic IP addresses to an EC2 instance automatically, etc., you would write Ruby (or bash or Python or whatever your favorite scripting language is) to run from
/etc/rc.local (assuming Linux, not Windows). Learning the command line tools does require considerable time, however. Fortunately, Amazon’s documentation is very good.
EC2 Instances: AWS Building Blocks
Amazon bills EC2 instances by the hour and offers many options for amount of memory and number of virtual CPU cores. On the low end, I bought a three- year EC2 reservation for a “small instance” (1 virtual CPU core with 1.7 GB of memory), which costs $0.03/hour (about $21/month because I always leave it running.) If you do not purchase a reserved instance, a small instance would cost $0.085/hour (or about $61/month if you leave it running 24/7). EC2 instances are available with up to 68 GB of memory for large applications.
For occasional needs, I set up a larger EC2 instance using the new EBS boot disk option (more on this later). EBS can be halted and restarted very quickly and you pay only for the time (billed by the hour) that it is active. For developers and companies on a budget (and who doesn’t try to save money!), this is a “killer feature” of AWS because it enables them to do things such as inexpensively start up several large servers for a few hours to test deployment strategies, quickly performing large computations, etc.
The easiest way to use an EC2 instance is to find an existing public AMI with the open source infrastructure software that you need already installed. Last year while writing a book on AWS, I set up an AMI that had all of the software I was writing about installed and configured. It allowed me to provide examples for readers to try without much setup time. (I use only Linux on EC2, but Windows is also available if that is your preferred platform.) As a software developer, think of the opportunity for marketing your web applications as AMIs for delivery to customers.
While EC2 instances are the basic infrastructure “units” when using AWS, Amazon also provides other high-level and scalable services that I will discuss next.
Elastic Block Storage
Elastic Block Storage (EBS) provides block-level storage volumes that you can format using any type of file system that is appropriate for your application. For increased performance (especially for read operations), I use multiple EBS volumes in a RAID 0 configuration. Why RAID 0? EBS volumes are very robust because they are replicated across multiple Amazon availability zones. So, it does not make sense to use error-correcting RAID; I just want better read performance.
While you may need RAID to get the level of performance your application requires, the easiest way to use EBS volumes is as single file systems. Even though EBS volumes are very durable (i.e., reliable), I still recommend periodically taking “snapshots” for additional backup.
Simple Queue Service
Using robust asynchronous messaging systems is the “secret sauce” that makes building reliable distributed systems possible. On a large scale, I have used reliable messaging on a worldwide nuclear test monitoring system (1980s) and a large-scale telephone credit card fraud detection system (1990s). The issues are different in building multiple server systems that run in a single data center, but reliable messaging is still the secret for making the architecture simpler for complex systems.
The basic idea of the Simple Queue Service (SQS) is that you can write structured data to what is effectively a globally accessible queue from which other processes can provisionally remove the data and process it. If a worker process fails to acknowledge success in using a queue item, then that queue item is available for other worker processes.
SQS is very robust. Amazon designed and implemented SQS as the backbone for its transactional processing for fulfilling orders, etc. Use this robustness in your own applications, even small applications that run on a single EC2 instance.
SimpleDB and Simple Storage System
SimpleDB is very reliable (won’t lose data) and highly scalable (requiring no effort at all from you), and it provides schema-free data storage. SimpleDB also shares some of the inconveniences of the Google App Engine datastore for developers who are used to using relational databases.
Amazon’s Simple Storage System (S3) is really the backbone of AWS. It is used to store data files in containers called “buckets.” I’ve heard people say that the sun would explode sooner than they would lose data in S3. Personally, I believe that the sun is more reliable, but S3 is very robust because like EBS it is automatically replicated to multiple availability zones. Even for applications that I deploy to dedicated servers, I usually use S3 to store static data like web application file attachments and database backups.
Elastic MapReduce is a technique for scaling large calculations, using S3 for both storage of input data sets and output results. It is a cost effective way to start up several large servers for an hour or two to quickly process large data sets. I highly recommend it for processing large amounts of log data, for text and data mining, etc.
The term “map reduce” described operations in functional languages such as Lisp for many decades. Today, the term usually refers to Google’s algorithm and implementation s that involve very large shared file system management and worker processes that operate on this shared data.
I frequently use the open source Hadoop project, which implements Google’s MapReduce ideas. I used to run Hadoop “jobs” on my own home servers but now I almost always use Amazon’s Elastic MapReduce service.
Relational Database Service
I have never used Amazon’s Relational Database Service (RDS) on a customer job, but I did try it the first day it was released as a public service. RDS is an EC2 instance running a managed MySQL server. What makes RDS attractive is that it provides a single MySQL server with virtually no administration effort; Amazon keeps the server running and backs it up periodically for you.
Personally, I prefer not to use RDS because I prefer PostgreSQL (or PostGIS) as a relational database and managing my own database seems more flexible. Note that it is very easy to back up a database on one of your EC2 instances; I use a cron job with an entry that looks like this (all one line, split here for readability):
15 20 * * 1 (cd /home/devaccount; rm -f mydatabase*.txt; rm -r mydatabase*.zip; /usr/bin/pg_dump -U postgres mydatabase > mydatabase_pg_dump.txt; zip -9 -r mydatabase_pg_dump_monday.zip mydatabase_pg_dump.txt; s3cmd put mydatabase_pg_dump_monday.zip s3://marks_dev_db_backups/)
That said, if a RDS instance crashes in the middle of the night, Amazon will restart it for you. If the EC2 instance running your self-managed database server fails, you will either have to restart it manually and recover from the latest backup, or even better, automatically detect a failure and recover.
Elastic Load Balancing
Elastic Load Balancing (ELB) is a relatively new Amazon service that I have used on only one customer project. ELB takes the place of HAProxy to distribute incoming HTTP requests across multiple servers. While running HAProxy yourself on an EC2 instance that you manage is a good alternative, I think that you might as well simply use ELB and let Amazon manage this infrastructure task for you.
One decision that you need to make is whether you want to partition your application across more than one availability zone. The advantage of using multiple zones is that there is very little chance of your application becoming unavailable if one of Amazon’s availability zones goes off line for any reason. The disadvantage is increased system complexity, the cost of the ELB service, and paying for non-local bandwidth charges between availability zones (bandwidth is free inside a single availability zone).
You will have to determine the business cost of potentially having your web application unavailable.
Much has been written on the reliability of cloud deployments. For many types of applications, I think that almost 100 percent uptime is not the point. I think the real point to cloud platforms is that they allow developers to reduce drastically the costs (both development time and deployment costs) of developing innovative new web applications that users will enjoy and want to use.
About the author
Mark Watson is a consultant living in the mountains of Central Arizona with his wife Carol and a very feisty Meyers Parrot. He specializes in web applications, text mining, and artificial intelligence. He is the author of 16 books and writes both a technology blog and an artificial intelligence blog at www.markwatson.com.