[Edinburgh-pm] Amazon Virtual Servers

Murray perl at minty.org
Thu Jan 10 05:02:40 PST 2008


I was talking to Wim about this last night, and thought I'd CC y'all
too.  My initial interest was in cheap rsync'able offsite backup storage
($25/month for 100GB).  Wim I think was more interested in the
distributed / grid computing potential.

http://aws.amazon.com/ec2

Rentable machines from $0.10/hour.  It is per hour that the machine is
running, not per CPU hour of work done.  So if you start 1 machine,
walk away and leave them idling for a day, you will still be billed
$2.40.  Their prices don't include VAT, which they will add if you use
an account with a european billing address.

Data on the machine is wiped when you shut the machine down.  However,
you can transfer data to Amazon S3 (resiliant storage).  You pay for the
data stored, but not for the bandwidth between EC2 machines and S3.

http://aws.amazon.com/s3

I spent about 60p and a couple of hours before Christmas playing with
this.  It's really quite impressive.  I found this to be a good guide to
work through:

http://docs.amazonwebservices.com/AWSEC2/2007-08-29/GettingStartedGuide/

Note that Amazon's default images are all RedHat / Fedora, however there
are many "Community" supported AIM (images) that include Ubuntu, SuSE
etc.  Or you can create your own (either from scratch or by mod'ing an
existing image) and use that.

It is quite possible you may find an existing public image with the Grid
software you need already setup, I don't really know:

http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=101

What follows is a scheme I worked out that would use the above to
provide 100Gb of offline, resilient backup storage which you could run
rsync to.

-----

A bit of a fiddle to setup initially, and you require fractionally more
scripting than a single rsync on the crontab.

Amazon EC2:
- $0.10 per clock hour that your "virtual host", aka instance, is
  running.  NOT per CPU hour consumed.
- You get ~140Gb "free" storage on your instance.  Storage only persists
  so long as the instance is running (@ $0.10/hour).
- Bandwidth: $0.10/Gb in, $0.18/Gb out

Amazon S3:
- $0.15 per GB-Month of storage used
- $0.01 per 1000 PUT requests
- $0.01 per 10,000 GET requests.
- Free bandwidth between S3 and EC2.
- Max file size: 5Gb

One can create, start/stop, terminate EC2 instances via command line
tools, albeit that they require a Java Runtime.  Once running, you can
ssh to the instance.   You ssh as root, password-less, using a key-pair.

This is all automatable with about a dozen lines of bash, including
getting the domain/ip of your newly created instance.

First up, we need a loopback style filesystem, hence forth called LBFS.
Ideally, we want a "growable" one, which is only as large as is required
for the data it contains - but for now, I'm going to assume a
simple/standard LBFS which is 100Gb is size.

Create LBFS, then use "split" to chop it into 20 individual 5Gb chunks.
Each of these 20 chunks is stored in Amazon S3.

Per backup-run:

1. Create an EC2 instance.
2. ssh to said instance, and GET 20 LBFS chunks of 5Gb each from S3
   (Free bandwidth, $0.000002 for the 20 GET ops)
3. cat LBFS.5gb.chunks* > LBFS.100gb.file
4. Mount LBFS
5. Run rsync / rdiff-backup etc.
   ($0.10/$0.18 per Gb transferred in/out)
6. Unmount LBFS
7. split -b 5120m LBFS
8. PUT 20 LBFS.5gb.chunks back to S3
   (Free bandwidth, $0.00002 for the PUTs)
9. Terminate the EC2 instance.

In addition to bandwidth charges, you have the S3 storage and $0.10 per
(wall clock) hour that your EC2 instance is running to complete this
task.

1 hour @ 40Kb/sec upload on ADSL =~ 140Mb.  Let's assume that is a
decent daily average.  Conveniently, 140Mb/day == 100Gb every 2 years.

Fetching/putting the 100Gb LBFS.  Say 5Mb/sec =~ 6 hours each way?  This
is "internal" bandwidth, free, between Amazon machines.

5Mb/s is the one un-tested part of my theory, but some googling suggests
it's not un-reasonable and by parallelising the 20 GETs we may improve
on this.  Anyway...

Running the above:

  $0.60 : 6 hours fetching 20*5gb LBFS chunks (CPU cost of EC2)
  $0.70 : 7 hours rsyncing (CPU cost of EC2)
  $0.40 : Worst case bandwidth (7*140Mb = 1Gb up & 1 Gb down worse case)
  $0.60 : 6 hours putting loopback-filesystem (CPU cost of EC2)

Assume we ran this once per week.  ($2.30*4) + $15/month for S3 storing
100Gb

~ $25/month for 100Gb redundant, offsite storage backed up weekly via
rsync, plus some additional command line shenanigans that you only have
to write once.  I estimate < 100 lines of bash.

If you left your EC2 instance running 24/7, it would be $75/month, which
is more than bytemark, but then you get 140Gb of disk space included and
1.7Gb of RAM.  Afaik, your IP address remains static as long as the
instance remains running, but they don't make promises about uptime.

The equivalent $25 on rsync.net gets you 15Gb, but more frequent backups
if you need em.

50Gb instead of 100Gb then it's ~$15/month all-in on Amazon.


More information about the Edinburgh-pm mailing list