Upgrade to CentOS and add 2 nodes to existing

Cluster Upgrade Description:

Goal: Upgrade and expand the existing Collum High Performance Computing (HPC) cluster.

The existing cluster consists of a head node and 6 slave compute nodes. The desire is to purchase 2 new nodes and expand the cluster to 8 compute nodes.

Cluster software currently in use includes:

  • Fedora 11 - Operating System (2009 release, current version is Fedora 19)
  • Perceus - a slave node provisioning and management system (Note: Percus is now obsolete, development has ended.)
  • Torque - a distributed resource manager / queuing system, providing control over batch jobs and distributed compute nodes
  • Maui Cluster Scheduler - job scheduler for use on clusters
  • Web MO - a Web-based interface to computational chemistry packages

Notes:

  • Discovery shows that the new nodes will require an updated operating system version, as they are not supported in Fedora 11. (Will not boot on current OS Configuration).
  • ChemIT is now using CentOS (6.4) for new OS installations instead of Fedora
  • Warewulf has superseded Perceus for cluster node provisioning and management, and is ChemIT's current provisioning package
  • Torque and Maui are still the preferred manager and scheduler.

In order to provide a upgrade to current OS and provide a smooth transition, the proposed upgrade and sequence are as follows:

  • Build new cluster with current software, utilizing one of of the new nodes as a head node. Once this is working, transition the existing cluster hardware to the new cluster.
    • Install HPC Cluster software- CentOS 6.4, Warewulf, Torque, Maui, and Web MO
    • Install applications
    • Add 2nd new node as a slave, creating a functioning cluster
    • test, verify
    • Move accounts, data, and computing nodes from old cluster to the new cluster.

This will result in a fully upgraded cluster, using the current HPC tools, with a newer Head node (Under Warranty).

Plan

See work estimates below for detailed steps.

Overview plan:

  • Pull an old Compute Node (CN) and convert to a new CentOS Head Node (“new HN”)
  • Add new CNs to new HN
  • Add one old CN’s to new HN
  • After testing, shift production to new HN.
  • Add the rest of the old CN’s to new HN
  • Convert old HN to CN and add it to new HN
  • Later; see P41: Migrate to ChemIT Community Head Node
    • Add all CN’s to ChemIT Community HN
    • Convert Collum HN to CN and add to ChemIT Community HN

Risks

And possible ways to address them.

Gaussian needs to be recompiled (under new OS).

  • If so, add A LOT of time and uncertainty. Spools up a whole new, large project, and crack open the PGI compiler.

WebMO needs to be upgraded from 2010 version. Perhaps a good idea to do anyway, if "good reasons" are identified.

  • If so, $1,000. And time/ process to order and get software, learn of differences, and apply it.

Node 2 turns out to be a dud (fails or is unstable).

  • If so, use another node (for example, 3).

Time and labor estimates

View Project timeline: Collum Cluster timeline.pdf

Est. labor: ~286 hours

Duration - Start-to-finish time (taking into account availability): 30 work days (start 7/31, Wed)

Work descriptions

Effort
in Hours or days

Elapse time
(usually days or weeks)

Est. date

Install and config CentOS on headnode
(Use Node 2)

3 days

1 week

 

Install and config Warewulf
Includes attaching one (new) node

5 days

1 week

 

Install and config Torque / maui

2 days

.5 wk

 

Install and config WebMO

5 days

1 week

 

Install and config 2nd (new) node
and 3rd (old) node.

1 day

.2 week

 

Copy Jun data for test

2 days

.5 wk

 

Jun (and group) test & tweak

2 days

1 week


Cleanup / additional

2 days

.5 wk

 

Move old cluster nodes to new Cluster
Keep the old head node for 1 month?

1 day

.2 wk

 

Move old cluster user data to new cluster
(Collum group down time)

2 - 3 days

.5 week

 

Total

25 days

6.4 weeks


 

 

 

 

  • No labels