Excerpt |
---|
Three Cornell-based research-specific cloud case studies, from July 2014 , from a Cornell researcher. |
...
through April 2016. |
Table of Contents |
---|
See also
Case study 1, July 2014
Cornell researcher, info from July 2014.
Brevity is not my strong point, but let me give it a go -
...
Let me know if you want any more details or if you want to chat in person it further or about what Cornell is brokering. I think it is an excellent idea for Cornell to broker a deal with Amazon.
Case study 2, March 2015
Cornell researcher, info from March 2015.
I am running Bayesian models on quantitative genetic data. Typically, when dealing with real data sets one workstation is enough. However, when I am analyzing multiple data sets (for example, currently 100 simulated data sets to be analyzed with a number of different models), I need to monopolize resources of multiple multi-core servers for 12 to 36 hours. In these kinds of cases, I use AWS. I have a disk image with my data stored in AWS (around 30G, at a cost of ~$0.50/month). I request c3.8xLarge or cc2.8xlarge compute-optimized instances on the spot market. I use the Amazon Linux AMI. This keeps the costs down (I wait for the spot price to fall below 30 cents/hour, which happens most of the time), at the expense of getting the instance shut off if the spot price goes above my bid price. I bid at the official price + 1 cent, which seems to be good enough. I have only lost a couple of instances over the last year and a half. When I run my instances, I activate them using the web interface, make a data disk for each instance from my data disk image (EBS volumes cannot be attached to more than one instance), and run a shell script in the instance (the shell script runs in the background). The shell script reads the data and unmounts the data disk (so I delete the data disks soon after the jobs start, again to save on cost). Several simulated data sets are processed at a time. Once a batch is done, the results are exported (via piping tar output to ssh — seems to be the most stable solution) to a "lab" server. This is so that if an instance is lost, not all results have to be re-generated. Data transfer costs are about 1/3 of the compute costs (1/4 of the total). I do not use more than 10 instances at a time (20 is the AWS limit) because it seems like running more puts too much pressure on the "lab" servers. The amount of data transferred is on the order of 20G per instance (some analyses it’s only 2G), in something like 15 installments. Once all the analyses are done, my script shuts down the instance which then terminates. This way I only pay for the exact time I need.
Case study 3, April 2016
Cornell Lab of Ornithology, info from April 2016.
At the Cornell Lab of Ornithology, we are using data collected by citizen scientists to understand where bird populations are and how they move during migrations. The goal of this analysis is to produce high resolution (3km) estimates of the spatial and temporal abundance of species populations at weekly intervals across North America. A visualization of the estimated distribution for Tree Swallow can be seen here [http://ebird.org/content/ebird/wp-content/uploads/sites/55/TRES_1s.gif] The information we extract from this data is being used to identify the environmental drivers that shape species’ distributions and for developing science-based management policies.
The observational data for this analysis come from eBird.org, a project run by the Cornell Lab of Ornithology that engages volunteers via the Internet and mobile apps to collect bird observations. We use statistical and machine learning models combining the eBird observations with remote sensing data from NASA to correct for biases in the data and fill in spatiotemporal gaps. Then we use the models to produce a comprehensive set of high-resolution estimates and summaries about the spatial and temporal abundance of species populations. Each job estimates a single species’ population distribution across North America.
This is a CPU intensive modeling task (>5000 CPU hrs per job) with relatively small input data (~ a couple of GB) and larger, but still relatively small output data sizes ( ~ 20-50GB per job). Our workflow is a multistep implementation of map-reduce using Streaming Hadoop 2.6 with some Oozie and AWK scripts to automate the workflow. The statistical analysis is carried out in R, the statistical computing language. Using the cloud to run this workflow has allowed us to drastically reduce the wall-clock time to complete jobs. We have run this workflow on Amazon AWS/EMR using the Spot market (with Spot Fleets) with r3.4xlarge head node and (up to) 230 m3.2xlarge worker nodes. Currently, we are using Microsoft Azure HDIsight (LINUX Hadoop clusters) using clusters with D14 head nodes and 150 D4 worker nodes (1200 cores). The workflow takes 3-5 hours per job (species) on the Azure HDIsight.
N.B. Pricing information not included because pricing has been changing so much. Amazon's Spot Market is the classic example. But its more than the market price, Amazon continues to include new ways of bidding on the market - Spot Fleets made it possible to acquire larger blocks of processors and there is another option where you can pay a little more on the spot market and insure that you keep your processors for at least 6 hours (I forget what this is called!). Azure prices have changed too. For us, the critical pieces of information have been run time and resource requirements, and the demonstration that we can utilize these cloud resources at the necessary scale, etc. For example, it was useful for my group to demonstrate that we could we could get enough processors for a long enough time period to run our workflow using the spot market.