NSF grant awarded. Thus, this is a "go" as of August 2013.

Notes for Hoffmann cluster researchers

Recap:

During this recent test period, researchers confirmed that the new system's hardware and base configuration fully enables the proper functioning of all the software they currently depend on within Helios. We discovered important things that we could not have learned without this testing investment, which in turn will contribute to a more robust setup for your researchers.

Upcoming to do's for your researchers, as of 11/27/13:

(1) Ensure researchers' user data on Sol can be completely deleted.

We will be deleting all the home directories. (They were originally copied there as a testing convenience during our test phase.)
We do this to enable a clean import of researchers' current production home directories from Helios.
Lulu has advised them of this from before the Thanksgiving break, so hopefully no surprises.
Bottom line: If there are any files on Sol which researchers cannot afford to lose (hopefully not- it was in test mode, and it's not being backed up!), researchers must move that data to Helios right away (deadline?).

2) After a predetermined date, researchers will not have access to Sol until it is again ready for testing.

Sol will need to be turned off and worked on by us. (Approximate duration?) Therefore, it will be off-line to researcher.
Ad hoc permission to select researches (Prasad?) will be granted so they may install and test their software.
Tasks for us to do:
- Get the server's versioning hard disk recognized within the new head node.
- Lulu to install the Intel compiler in the /software/ directory, with the proper licensing.
- Your group (Prasad?) to install the Materials Studio software in the /data/ Hoffmann/bin/ directory. If Prasad is not available to do this, let's discuss our options- thank you.

3) Researchers will be asked to test select software on Sol.

Changes to where the Intel compiler and Materials Studio reside (to facilitated debugging and support) must be confirmed to work for researchers.
Again we are testing on Sol. Thus, the only production work should still be on Helios, not Sol.

Oliver's quick notes:

11/15/13, from email sent to Roald, by Oliver:

We are still working to learn how much of the Intel package components your group needs/ wants, and to what degree a non-current version might meet their needs in case we don't need to invest in the most recent version

The rest of the email:

I believe we have a short-term answer to address your group's need for the Intel compiler. This temporary solution will work as long as we are earnestly working on a long-term solution.

When you return, I would appreciate discussing options to meet your long-term needs cost-effectively. Michael Hint and I have some imaginative ideas, but some involve coordinating with other CCB researchers (Scheraga and Ananth).

Michael Hint learned the difference between a single user license and the 2-person concurrent licenses. The answer (below) informs one of our ideas, if you care to read it before we meet. But no problem if you don't get to it since I can summarize when I discuss our ideas.

==================================

Sent: Friday, November 08, 2013 4:00 PM
To: Guoying Gao
Cc: Oliver B. Habicht
Subject: one computer matter which will come up

Hi, Guoying,

In his conversation with me, Oliver said that he has investigated the error message referring to absence of a license for an Intel Compiler. It indeed appears that we had been using an older compiler, and that to use it in the future we will need to get a license. But that license is costly; Oliver is determining how much. What he will ask you, and you might inquire of the group in preparation is how often we use that compiler – daily, once a week, once a year? And how many people in the group are using it.

Thanks,

roald

==================================

Next steps

Meet to review all options and confirm desired direction and expected timing.
Review resources. Huayun Gen has cluster management experience, including set-up.

Draft idea

Create a stand-along cluster using new hardware ($25K for minimum of 3 years operations (to confirm!). Thus, ~$8K/yr in hardware)).
- Uses new OS and related cluster management software.
- Install and configure necessary applications.
- Enable NetID-based access, if possible (limit 2-3 days for a "go/no-go" decision on this functionality)
Confirm old nodes can successfully be added to that new cluster.
Migrate users and data to new cluster.
Migrate old nodes to new cluster.

Unknowns

Time for install of all necessary applications, many of which are new to Lulu. Then configure, verify, and de-bug new-installation-related issues.
Whether NetID-based access will succeed. But note that this is not a do-or-die step, thus we will limit the duration of our investigation, with the hope that we can make this happen.

Tasks and estimated timing

Top Level Task Description	Effort Est.	Assignee
Planning
Discovery/ Overview mtg	1.5 hrs
Vet options and conduct needs analysis to match to hardware order	1-2 weeks
Specify exactly the systems to order within budget. Includes iterating with vendor experts.	1 week
Approval	0 days
Order & Installation
Place & Process order	1/2 week
Delivery, after order is placed at Cornell	~3 weeks
Receive order and set-up hardware in 248 Baker Lab	1 week
Build New Cluster
Get head node and 1st cluster node operational with OS and cluster management software	3 weeks
Test / Verify / Approval	1 week
Convert Old Cluster
Move user accounts and data; test, prep, and do	1 week
Move old nodes to new cluster	1 week

Lulu becomes available ~mid-September or early Oct, as of 8/21/13.
See unknowns, above, which related to tasks which will obviously take additional time to accomplish.

Other provisioning models and related ideas

We can walk through rates and scenarios, as appropriate.
We can meet with CAC since they may be willing to do more with a commitment of $25K than is published with their $400 min. offering.
- Brainstorming idea: Would they be willing to add hardware to CAC's RedCloud to get a buyer of that hardware a better cost and/or privileged access?

Buy cycles, on demand

Good for irregular high-performance demands, especially if have high peaks of need and long-lasting jobs.

Buy cycles from CAC (RedCloud, minimum of $400 for 8585 core*hour
- http://www.cac.cornell.edu/RedCloud/start.aspx
- 12 cores available at any one time on one system.
  - Can access more than one system at a time, but systems are not linked.
- $400 (minimum) buys you 8585 core*hours
  - This comes out to ~1 core for an entire year, non-stop.
- For 96 cores, that's $38.4K for 1 year, non-stop.(They have a max of 96 cores <http://www.it.cornell.edu/about/projects/virtual/test.cfm>.)
  - 96 = 8 nodes, each with dual 6-core procs => 8 * 12 = 96
- Or, for $25K, that's ~536,562 cores*hours.
  - $25K = $400*62.5 units. And each unit is 8585 core*hours, so 62.5 of them gets you 536,562.5 cores*hours.
  - That comes to ~178,854 core*hours/ yr for 3 years. Which is 20.8 core system running non-stop each year. (Compare to one hardware node, which has 12 cores.)

CNF, w/ Derek Stuart.
- A very reliable cluster, per Roald.

Determine costs, processes, and trade-offs if use another cloud service, such as:
- Amazon. Amazon AC3?
- Google. Google Compute?
- Microsoft. Microsoft Azure?

Host hardware at CAC rather than with ChemIT

Hosting costs at CAC is for basic: Expert initial configuration, then keep the system current, and keep the lights running. Other service charged hourly.

Per the above rate calculator, the rate for 9 nodes (1 head node + 8 compute nodes) would be $8,291/yr. Or, $24,873 for 3 years for this service.

At current ChemIT rates, 9 nodes would be $321.84/yr. Or, $965.52 for 3 years of service.

ChemIT rates are set by the CCB Computing Cmt and may change at any time. The rate for a group's single system (in a cluster or not) is $2.98/month, or $35.76/yr.

Table, related to our options

Option ==> Consideration, below:	ChemIT	CAC: RedCloud	CAC: Hosting	Amazon (EC3?) or Google (Compute?)
Hardware costs	$25K	-	$25K	-
Hardware support	Yes.	-	Yes.	-
OS install and configuration	Yes. CentOS 6.4		Yes. CentOS 6.4
Cluster and queuing management	Yes. Warewulf, with options	-	Yes. ROCKS, no options.	-
Research software install and configuration	Yes	No	Yes; additional cost	No
Application debugging and optimization support	Not usually. Available from CAC, at additional cost?	Yes; additional cost	Yes; additional cost	No. Available from CAC, at additional cost?

Space shortcuts

Child pages

Notes for Hoffmann cluster researchers

Recap:

Upcoming to do's for your researchers, as of 11/27/13:

(1) Ensure researchers' user data on Sol can be completely deleted.

2) After a predetermined date, researchers will not have access to Sol until it is again ready for testing.

3) Researchers will be asked to test select software on Sol.

Oliver's quick notes:

Next steps

Draft idea

Unknowns

Tasks and estimated timing

Other provisioning models and related ideas

Buy cycles, on demand

Host hardware at CAC rather than with ChemIT

Table, related to our options

Space shortcuts

Child pages

P30 - Hoffmann's cluster upgrade

Notes for Hoffmann cluster researchers

Recap:

Upcoming to do's for your researchers, as of 11/27/13:

(1) Ensure researchers' user data on Sol can be completely deleted.

2) After a predetermined date, researchers will not have access to Sol until it is again ready for testing.

3) Researchers will be asked to test select software on Sol.

Oliver's quick notes:

Next steps

Draft idea

Unknowns

Tasks and estimated timing

Other provisioning models and related ideas

Buy cycles, on demand

Host hardware at CAC rather than with ChemIT

Table, related to our options