Upgrade to CentOS and add 2 nodes to existing
Cluster Upgrade Description:
Goal: Upgrade and expand the existing Collum High Performance Computing (HPC) cluster.
The existing cluster consists of a head node and 6 slave compute nodes. The desire is to purchase 2 new nodes and expand the cluster to 8 compute nodes.
Cluster software currently in use includes:
- Fedora 11 - Operating System (2009 release, current version is Fedora 19)
- Perceus - a slave node provisioning and management system (Note: Percus is now obsolete, development has ended.)
- Torque - a distributed resource manager / queuing system, providing control over batch jobs and distributed compute nodes
- Maui Cluster Scheduler - job scheduler for use on clusters
- Web MO - a Web-based interface to computational chemistry packages
Notes:
- Discovery shows that the new nodes will require an updated operating system version, as they are not supported in Fedora 11. (Will not boot on current OS Configuration).
- ChemIT is now using CentOS (6.4) for new OS installations instead of Fedora
- Warewulf has superseded Perceus for cluster node provisioning and management, and is ChemIT's current provisioning package
- Torque and Maui are still the preferred manager and scheduler.
In order to provide a upgrade to current OS and provide a smooth transition, the proposed upgrade and sequence are as follows:
- Build new cluster with current software, utilizing one of of the new nodes as a head node. Once this is working, transition the existing cluster hardware to the new cluster.
- Install HPC Cluster software- CentOS 6.4, Warewulf, Torque, Maui, and Web MO
- Install applications
- Add 2nd new node as a slave, creating a functioning cluster
- test, verify
- Move accounts, data, and computing nodes from old cluster to the new cluster.
This will result in a fully upgraded cluster, using the current HPC tools, with a newer Head node (Under Warranty).
Plan
See work estimates below for detailed steps.
Overview plan:
- Pull an old Compute Node (CN) and convert to a new CentOS Head Node (“new HN”)
- Add new CNs to new HN
- Add one old CN’s to new HN
- After testing, shift production to new HN.
- Add the rest of the old CN’s to new HN
- Convert old HN to CN and add it to new HN
- Later; see P41: Migrate to ChemIT Community Head Node
- Add all CN’s to ChemIT Community HN
- Convert Collum HN to CN and add to ChemIT Community HN
Risks
And possible ways to address them.
Gaussian needs to be recompiled (under new OS).
- If so, add A LOT of time and uncertainty. Spools up a whole new, large project, and crack open the PGI compiler.
WebMO needs to be upgraded from 2010 version. Perhaps a good idea to do anyway, if "good reasons" are identified.
- If so, $1,000. And time/ process to order and get software, learn of differences, and apply it.
Node 2 turns out to be a dud (fails or is unstable).
- If so, use another node (for example, 3).
Time and labor estimates
View Project timeline: Collum Cluster timeline.pdf
Est. labor: ~286 hours
Duration - Start-to-finish time (taking into account availability): 30 work days (start 7/31, Wed)
Work descriptions |
Effort |
Elapse time |
Est. date |
||||
---|---|---|---|---|---|---|---|
Install and config CentOS on headnode |
3 days |
1 week |
|
||||
Install and config Warewulf |
5 days |
1 week |
|
||||
Install and config Torque / maui |
2 days |
.5 wk |
|
||||
Install and config WebMO |
5 days |
1 week |
|
||||
Install and config 2nd (new) node |
1 day |
.2 week |
|
||||
Copy Jun data for test |
2 days |
.5 wk |
|
||||
Jun (and group) test & tweak |
2 days |
1 week |
|
||||
Cleanup / additional |
2 days |
.5 wk |
|
||||
Move old cluster nodes to new Cluster |
1 day |
.2 wk |
|
||||
Move old cluster user data to new cluster |
2 - 3 days |
.5 week |
|
||||
Total |
25 days |
6.4 weeks |
|
|
|
|
|