Upgrade to CentOS and add 2 nodes to existing

Cluster Upgrade Description:

Goal: Upgrade and expand the existing Collum High Performance Computing (HPC) cluster.

The existing cluster consists of a head node and 6 slave compute nodes. The desire is to purchase 2 new nodes and expand the cluster to 8 compute nodes.

Cluster software currently in use includes:

Fedora 11 - Operating System (2009 release, current version is Fedora 19)
Perceus - a slave node provisioning and management system (Note: Percus is now obsolete, development has ended.)
Torque - a distributed resource manager / queuing system, providing control over batch jobs and distributed compute nodes
Maui Cluster Scheduler - job scheduler for use on clusters
Web MO - a Web-based interface to computational chemistry packages

Notes:

Discovery shows that the new nodes will require an updated operating system version, as they are not supported in Fedora 11. (Will not boot on current OS Configuration).
ChemIT is now using CentOS (6.4) for new OS installations instead of Fedora
Warewulf has superseded Perceus for cluster node provisioning and management, and is ChemIT's current provisioning package
Torque and Maui are still the preferred manager and scheduler.

In order to provide a upgrade to current OS and provide a smooth transition, the proposed upgrade and sequence are as follows:

Build new cluster with current software, utilizing one of of the new nodes as a head node. Once this is working, transition the existing cluster hardware to the new cluster.
- Install HPC Cluster software- CentOS 6.4, Warewulf, Torque, Maui, and Web MO
- Install applications
- Add 2nd new node as a slave, creating a functioning cluster
- test, verify
- Move accounts, data, and computing nodes from old cluster to the new cluster.

This will result in a fully upgraded cluster, using the current HPC tools, with a newer Head node (Under Warranty).

Plan

See work estimates below for detailed steps.

Overview plan:

Pull an old Compute Node (CN) and convert to a new CentOS Head Node (“new HN”)
Add new CNs to new HN
Add one old CN’s to new HN
After testing, shift production to new HN.
Add the rest of the old CN’s to new HN
Convert old HN to CN and add it to new HN
Later; see P41: Migrate to ChemIT Community Head Node
- Add all CN’s to ChemIT Community HN
- Convert Collum HN to CN and add to ChemIT Community HN

Risks

And possible ways to address them.

Gaussian needs to be recompiled (under new OS).

If so, add A LOT of time and uncertainty. Spools up a whole new, large project, and crack open the PGI compiler.

WebMO needs to be upgraded from 2010 version. Perhaps a good idea to do anyway, if "good reasons" are identified.

If so, $1,000. And time/ process to order and get software, learn of differences, and apply it.

Node 2 turns out to be a dud (fails or is unstable).

If so, use another node (for example, 3).

Time and labor estimates

View Project timeline: Collum Cluster timeline.pdf

Est. labor: ~286 hours

Duration - Start-to-finish time (taking into account availability): 30 work days (start 7/31, Wed)

Work descriptions	Effort in Hours or days	Elapse time (usually days or weeks)
Install and config CentOS on headnode (Use Node 2)	3 days	1 week
Install and config Warewulf Includes attaching one (new) node	5 days	1 week
Install and config Torque / maui	2 days	.5 wk
Install and config WebMO	5 days	1 week
Install and config 2nd (new) node and 3rd (old) node.	1 day	.2 week
Copy Jun data for test	2 days	.5 wk
Jun (and group) test & tweak	2 days	1 week
Cleanup / additional	2 days	.5 wk
Move old cluster nodes to new Cluster Keep the old head node for 1 month?	1 day	.2 wk
Move old cluster user data to new cluster (Collum group down time)	2 - 3 days	.5 week
Total	25 days	6.4 weeks

Space shortcuts

Child pages

Cluster Upgrade Description:

Plan

Risks

Time and labor estimates

Space shortcuts

Child pages

P40 - Collum Cluster Upgrade

Cluster Upgrade Description:

Plan

Risks

Time and labor estimates