Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

July 2014, from a Cornell researcherThree Cornell-based research-specific cloud case studies, from July 2014 through April 2016.

Table of Contents

See also

Case study 1, July 2014

Cornell researcher, info from July 2014.

Brevity is not my strong point, but let me give it a go -

...

Let me know if you want any more details or if you want to chat in person it further or about what Cornell is brokering. I think it is an excellent idea for Cornell to broker a deal with Amazon. 

Case study 2, March 2015

Cornell researcher, info from March 2015.

I am running Bayesian models on quantitative genetic data.  Typically, when dealing with real data sets one workstation is enough. However, when I am analyzing multiple data sets (for example, currently 100 simulated data sets to be analyzed with a number of different models), I need to monopolize resources of multiple multi-core servers for 12 to 36 hours.  In these kinds of cases, I use AWS.  I have a disk image with my data stored in AWS (around 30G, at a cost of ~$0.50/month).  I request c3.8xLarge or cc2.8xlarge compute-optimized instances on the spot market.  I use the Amazon Linux AMI.  This keeps the costs down (I wait for the spot price to fall below 30 cents/hour, which happens most of the time), at the expense of getting the instance shut off if the spot price goes above my bid price.  I bid at the official price + 1 cent, which seems to be good enough.  I have only lost a couple of instances over the last year and a half.  When I run my instances, I activate them using the web interface, make a data disk for each instance from my data disk image (EBS volumes cannot be attached to more than one instance), and run a shell script in the instance (the shell script runs in the background). The shell script reads the data and unmounts the data disk (so I delete the data disks soon after the jobs start, again to save on cost).  Several simulated data sets are processed at a time.  Once a batch is done, the results are exported (via piping tar output to ssh — seems to be the most stable solution) to a "lab" server.  This is so that if an instance is lost, not all results have to be re-generated.  Data transfer costs are about 1/3 of the compute costs (1/4 of the total).  I do not use more than 10 instances at a time (20 is the AWS limit) because it seems like running more puts too much pressure on the "lab" servers.  The amount of data transferred is on the order of 20G per instance (some analyses it’s only 2G), in something like 15 installments. Once all the analyses are done, my script shuts down the instance which then terminates. This way I only pay for the exact time I need.

 

Case study 3, April 2016

 

Cornell Lab of Ornithology, info from April 2016.

At the Cornell Lab of Ornithology, we are using data collected by citizen scientists to understand where bird populations are and how they move during migrations. The goal of this analysis is to produce high resolution (3km) estimates of the spatial and temporal abundance of species populations at weekly intervals across North America. A visualization of the estimated distribution for Tree Swallow can be seen here [http://ebird.org/content/ebird/wp-content/uploads/sites/55/TRES_1s.gif] The information we extract from this data is being used to identify the environmental drivers that shape species’ distributions and for developing science-based management policies.

...