NSF grant awarded. Thus, this project is a "go" as of August 2013.
Summary information for Hoffmann researchers, as of 12/11/13:
Helios and Sol will be unavailable to researchers between Monday, Jan. 6th at 9am and approximately Thursday, Jan 9th, at noon.
- Before the shutdown, all data of value must be copied off of Sol.
- All user directory data will be removed.
- Before the shutdown, all jobs on both Sol and Helios must be stopped.
- Any remaining jobs will be stopped by ChemIT staff
- Only Sol will be up after this work is done.(Helios will no longer be available after all this is done.)
- Data will be moved from Helios to Sol.
- And Helios's nodes will be added to Sol.
Detailed information for cluster migration:
1) Reduce the amount of researchers' temporary, test user data on Sol to the bare minimum during the break.
- After the break, ChemIT will be deleting the content within everyone's test home directories on Sol. (Those directories were originally copied there as a testing convenience during our test phase.)
- We do this deletion before the final copy to ensure a clean import of researchers' current production home directories from Helios.
- Lulu advised researchers of this step before the Thanksgiving break, so hopefully no surprises.
- During the winter break, Sol's data will be backed up.
- As with Helios, Sol's data will be copied and versioned to a dedicated, internal hard drive.
- NEW: EZ-Backup will be enabled on Sol, providing off-site backup protection.
Bottom line: If there are any files on Sol which researchers cannot afford to lose, researchers must move that data to Helios (or elsewhere) before Sol is turned off.
2) Monday, Jan. 6th at 9am: Helios AND Sol being turned off to researchers
Researchers must know that:
- ALL jobs must cease by then. Any remaining jobs will be turned off by ChemIT.
- Also, there must be no remaining research data on Sol.
Tasks for ChemIT to do:
- Copy researchers' production home directory data from Helios to Sol.
- Move Helios's compute nodes into Sol's rack and reconnect all networking and power.
- This allows us to add the older compute nodes from Helios, making Sol more powerful.
- Testing by ChemIT.
- Coordinate with Huayun (Deng) for initial testing, perhaps sometime Wednesday.
3) Thursday, Jan 9th, at noon: Expect Sol available to researchers.
- At this point, Sol will be up in full production and open to all researchers, with some or all of Helios's compute nodes attached.
The Hoffmann group is encouraged to improve their software installations in order to create a more robust environment and to improve support outcomes.
Tasks for Hoffmann group:
- Confirm that Sol is "good". That is, there is no need to "roll back" to using Helios's head node and for us to start again.
- Hoffmann group members are encouraged to install their shared group software in the /home/hoffmann/bin/ directory. Including:
- Materials Studio software. (Who? If Prasad is not available to do this, let's discuss your options- thank you.)
- vasp
- gulp
- etc.
Tasks for ChemIT:
- Once Hoffmann researchers confirm Sol is "good", Helios's head node can be added as a compute node to Sol.
NOTE: Helios will no longer be available after all this is done.
- Data will be moved from Helios to Sol.
- And Helios's nodes will be added to Sol.
Thank you -ChemIT
Older notes:
Next steps
- Meet to review all options and confirm desired direction and expected timing.
- Review resources. Huayun Gen has cluster management experience, including set-up.
Draft idea
- Create a stand-along cluster using new hardware ($25K for minimum of 3 years operations (to confirm!). Thus, ~$8K/yr in hardware)).
- Uses new OS and related cluster management software.
- Install and configure necessary applications.
- Enable NetID-based access, if possible (limit 2-3 days for a "go/no-go" decision on this functionality)
- Confirm old nodes can successfully be added to that new cluster.
- Migrate users and data to new cluster.
- Migrate old nodes to new cluster.
Unknowns
- Time for install of all necessary applications, many of which are new to Lulu. Then configure, verify, and de-bug new-installation-related issues.
- Whether NetID-based access will succeed. But note that this is not a do-or-die step, thus we will limit the duration of our investigation, with the hope that we can make this happen.
Tasks and estimated timing
Top Level Task Description |
Effort Est. |
Assignee |
---|---|---|
Planning |
|
|
Discovery/ Overview mtg |
1.5 hrs |
|
Vet options and conduct needs analysis to match to hardware order |
1-2 weeks |
|
Specify exactly the systems to order within budget. Includes iterating with vendor experts. |
1 week |
|
Approval |
0 days |
|
Order & Installation |
|
|
Place & Process order |
1/2 week |
|
Delivery, after order is placed at Cornell |
~3 weeks |
|
Receive order and set-up hardware in 248 Baker Lab |
1 week |
|
Build New Cluster |
|
|
Get head node and 1st cluster node operational with OS and cluster management software |
3 weeks |
|
Test / Verify / Approval |
1 week |
|
Convert Old Cluster |
|
|
Move user accounts and data; test, prep, and do |
1 week |
|
Move old nodes to new cluster |
1 week |
|
- Lulu becomes available ~mid-September or early Oct, as of 8/21/13.
- See unknowns, above, which related to tasks which will obviously take additional time to accomplish.
Other provisioning models and related ideas
- We can walk through rates and scenarios, as appropriate.
- We can meet with CAC since they may be willing to do more with a commitment of $25K than is published with their $400 min. offering.
- Brainstorming idea: Would they be willing to add hardware to CAC's RedCloud to get a buyer of that hardware a better cost and/or privileged access?
Buy cycles, on demand
Good for irregular high-performance demands, especially if have high peaks of need and long-lasting jobs.
- Buy cycles from CAC (RedCloud, minimum of $400 for 8585 core*hour
- http://www.cac.cornell.edu/RedCloud/start.aspx
- 12 cores available at any one time on one system.
- Can access more than one system at a time, but systems are not linked.
- $400 (minimum) buys you 8585 core*hours
- This comes out to ~1 core for an entire year, non-stop.
- For 96 cores, that's $38.4K for 1 year, non-stop.(They have a max of 96 cores <http://www.it.cornell.edu/about/projects/virtual/test.cfm>.)
- 96 = 8 nodes, each with dual 6-core procs => 8 * 12 = 96
- Or, for $25K, that's ~536,562 cores*hours.
- $25K = $400*62.5 units. And each unit is 8585 core*hours, so 62.5 of them gets you 536,562.5 cores*hours.
- That comes to ~178,854 core*hours/ yr for 3 years. Which is 20.8 core system running non-stop each year. (Compare to one hardware node, which has 12 cores.)
- CNF, w/ Derek Stuart.
- A very reliable cluster, per Roald.
- Determine costs, processes, and trade-offs if use another cloud service, such as:
- Amazon. Amazon AC3?
- Google. Google Compute?
- Microsoft. Microsoft Azure?
Host hardware at CAC rather than with ChemIT
Hosting costs at CAC is for basic: Expert initial configuration, then keep the system current, and keep the lights running. Other service charged hourly.
Per the above rate calculator, the rate for 9 nodes (1 head node + 8 compute nodes) would be $8,291/yr. Or, $24,873 for 3 years for this service.
At current ChemIT rates, 9 nodes would be $321.84/yr. Or, $965.52 for 3 years of service.
- ChemIT rates are set by the CCB Computing Cmt and may change at any time. The rate for a group's single system (in a cluster or not) is $2.98/month, or $35.76/yr.
Table, related to our options
Option ==> |
ChemIT |
CAC: |
CAC: |
Amazon (EC3?) or |
Other ideas? |
---|---|---|---|---|---|
Hardware costs |
$25K |
- |
$25K |
- |
|
Hardware support |
Yes. |
- |
Yes. |
- |
|
OS install and configuration |
Yes. CentOS 6.4 |
|
Yes. CentOS 6.4 |
|
|
Cluster and queuing management |
Yes. Warewulf, with options |
- |
Yes. ROCKS, no options. |
- |
|
Research software install and configuration |
Yes |
No |
Yes; additional cost |
No |
|
Application debugging and optimization support |
Not usually. |
Yes; additional cost |
Yes; additional cost |
No. |
|