NSF grant awarded. Thus, this is a "go" as of August 2013.
Notes for Hoffmann cluster researchers
Upcoming to do's for your researchers, as of 11/27/13:
(1) Ensure researchers' user data on Sol can be completely deleted.
- We will be deleting all the home directories. (They were originally copied there as a testing convenience during our test phase.)
- We do this to enable a clean import of researchers' current production home directories from Helios.
- Lulu has advised them of this from before the Thanksgiving break, so hopefully no surprises.
- Bottom line: If there are any files on Sol which researchers cannot afford to lose (hopefully not- it was in test mode, and it's not being backed up!), researchers must move that data to Helios right away (deadline?).
2) After a predetermined date, researchers will not have access to Sol until it is again ready for testing.
- Sol will need to be turned off and worked on by us. (Approximate duration?) Therefore, it will be off-line to researcher.
- Ad hoc permission to select researches (Prasad?) will be granted so they may install and test their software.
- Tasks for us to do:
- Get the server's versioning hard disk recognized within the new head node.
- Lulu to install the Intel compiler in the /software/ directory, with the proper licensing.
- Your group (Prasad?) to install the Materials Studio software in the /data/ Hoffmann/bin/ directory. If Prasad is not available to do this, let's discuss our options- thank you.
3) Researchers will be asked to test select software on Sol.
- Changes to where the Intel compiler and Materials Studio reside (to facilitated debugging and support) must be confirmed to work for researchers.
- Again we are testing on Sol. Thus, the only production work should still be on Helios, not Sol.
Oliver's quick notes:
11/15/13, from email sent to Roald, by Oliver:
- We are still working to learn how much of the Intel package components your group needs/ wants, and to what degree a non-current version might meet their needs in case we don't need to invest in the most recent version
The rest of the email:
I believe we have a short-term answer to address your group's need for the Intel compiler. This temporary solution will work as long as we are earnestly working on a long-term solution.
When you return, I would appreciate discussing options to meet your long-term needs cost-effectively. Michael Hint and I have some imaginative ideas, but some involve coordinating with other CCB researchers (Scheraga and Ananth).
Michael Hint learned the difference between a single user license and the 2-person concurrent licenses. The answer (below) informs one of our ideas, if you care to read it before we meet. But no problem if you don't get to it since I can summarize when I discuss our ideas.
==================================
Sent: Friday, November 08, 2013 4:00 PM
To: Guoying Gao
Cc: Oliver B. Habicht
Subject: one computer matter which will come up
Hi, Guoying,
In his conversation with me, Oliver said that he has investigated the error message referring to absence of a license for an Intel Compiler. It indeed appears that we had been using an older compiler, and that to use it in the future we will need to get a license. But that license is costly; Oliver is determining how much. What he will ask you, and you might inquire of the group in preparation is how often we use that compiler – daily, once a week, once a year? And how many people in the group are using it.
Thanks,
roald
==================================
Next steps
- Meet to review all options and confirm desired direction and expected timing.
- Review resources. Huayun Gen has cluster management experience, including set-up.
Draft idea
- Create a stand-along cluster using new hardware ($25K for minimum of 3 years operations (to confirm!). Thus, ~$8K/yr in hardware)).
- Uses new OS and related cluster management software.
- Install and configure necessary applications.
- Enable NetID-based access, if possible (limit 2-3 days for a "go/no-go" decision on this functionality)
- Confirm old nodes can successfully be added to that new cluster.
- Migrate users and data to new cluster.
- Migrate old nodes to new cluster.
Unknowns
- Time for install of all necessary applications, many of which are new to Lulu. Then configure, verify, and de-bug new-installation-related issues.
- Whether NetID-based access will succeed. But note that this is not a do-or-die step, thus we will limit the duration of our investigation, with the hope that we can make this happen.
Tasks and estimated timing
Top Level Task Description |
Effort Est. |
Assignee |
---|---|---|
Planning |
|
|
Discovery/ Overview mtg |
1.5 hrs |
|
Vet options and conduct needs analysis to match to hardware order |
1-2 weeks |
|
Specify exactly the systems to order within budget. Includes iterating with vendor experts. |
1 week |
|
Approval |
0 days |
|
Order & Installation |
|
|
Place & Process order |
1/2 week |
|
Delivery, after order is placed at Cornell |
~3 weeks |
|
Receive order and set-up hardware in 248 Baker Lab |
1 week |
|
Build New Cluster |
|
|
Get head node and 1st cluster node operational with OS and cluster management software |
3 weeks |
|
Test / Verify / Approval |
1 week |
|
Convert Old Cluster |
|
|
Move user accounts and data; test, prep, and do |
1 week |
|
Move old nodes to new cluster |
1 week |
|
- Lulu becomes available ~mid-September or early Oct, as of 8/21/13.
- See unknowns, above, which related to tasks which will obviously take additional time to accomplish.
Other provisioning models and related ideas
- We can walk through rates and scenarios, as appropriate.
- We can meet with CAC since they may be willing to do more with a commitment of $25K than is published with their $400 min. offering.
- Brainstorming idea: Would they be willing to add hardware to CAC's RedCloud to get a buyer of that hardware a better cost and/or privileged access?
Buy cycles, on demand
Good for irregular high-performance demands, especially if have high peaks of need and long-lasting jobs.
- Buy cycles from CAC (RedCloud, minimum of $400 for 8585 core*hour
- http://www.cac.cornell.edu/RedCloud/start.aspx
- 12 cores available at any one time on one system.
- Can access more than one system at a time, but systems are not linked.
- $400 (minimum) buys you 8585 core*hours
- This comes out to ~1 core for an entire year, non-stop.
- For 96 cores, that's $38.4K for 1 year, non-stop.(They have a max of 96 cores <http://www.it.cornell.edu/about/projects/virtual/test.cfm>.)
- 96 = 8 nodes, each with dual 6-core procs => 8 * 12 = 96
- Or, for $25K, that's ~536,562 cores*hours.
- $25K = $400*62.5 units. And each unit is 8585 core*hours, so 62.5 of them gets you 536,562.5 cores*hours.
- That comes to ~178,854 core*hours/ yr for 3 years. Which is 20.8 core system running non-stop each year. (Compare to one hardware node, which has 12 cores.)
- CNF, w/ Derek Stuart.
- A very reliable cluster, per Roald.
- Determine costs, processes, and trade-offs if use another cloud service, such as:
- Amazon. Amazon AC3?
- Google. Google Compute?
- Microsoft. Microsoft Azure?
Host hardware at CAC rather than with ChemIT
Hosting costs at CAC is for basic: Expert initial configuration, then keep the system current, and keep the lights running. Other service charged hourly.
Per the above rate calculator, the rate for 9 nodes (1 head node + 8 compute nodes) would be $8,291/yr. Or, $24,873 for 3 years for this service.
At current ChemIT rates, 9 nodes would be $321.84/yr. Or, $965.52 for 3 years of service.
- ChemIT rates are set by the CCB Computing Cmt and may change at any time. The rate for a group's single system (in a cluster or not) is $2.98/month, or $35.76/yr.
Table, related to our options
Option ==> |
ChemIT |
CAC: |
CAC: |
Amazon (EC3?) or |
Other ideas? |
---|---|---|---|---|---|
Hardware costs |
$25K |
- |
$25K |
- |
|
Hardware support |
Yes. |
- |
Yes. |
- |
|
OS install and configuration |
Yes. CentOS 6.4 |
|
Yes. CentOS 6.4 |
|
|
Cluster and queuing management |
Yes. Warewulf, with options |
- |
Yes. ROCKS, no options. |
- |
|
Research software install and configuration |
Yes |
No |
Yes; additional cost |
No |
|
Application debugging and optimization support |
Not usually. |
Yes; additional cost |
Yes; additional cost |
No. |
|