Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work.

Table of contents

Table of Contents

Scheduled maintenance and upgrades procedures

ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.

  • What is a long enough lead time for the group?
  • What is a short enough lead time for ChemIT?

Message will state:

  • Date and time of shut-down. Expected duration of shut-down.
    • Most events will occur Mon-Thur, 9am-5pm EST, when staffing and backup folks are available.
  • Purpose summary.

Message will be sent to:

  • Whom?

Sample message:

To: ?
Subject: PI's ClusterName: Date/ time planned down-time.

-----------------------------------------------

To all users of the PI's ClusterName,

On Date/ time, the cluster will be down for planned maintenance for 3 hours.

During this down-time, we intend to:

  • Test new GPU software capabilities
  • Update the OS of the storage system.
  • Update the BIOS of the 4 GPU clusters.
  • Update the UPS software to address current software's limitations.
  • Confirm backups and review other system software configurations.

-----------------------------------------------

Communication timeline

1) ChemIT notifies group rep. of planned date.

2) Group rep. confirms there is no better date (or negotiates a better date, with ChemIT staff).

3) Group rep. notifies all users of cluster, using message crafted by ChemIT.

  • Or, group rep. requests ChemIT send the email to all cluster users, on their behalf. Message sent to users' <NetID@cornell.edu> address.

4) The work day before the shut down, ChemIT sends a reminder.

5) When cluster is shutdown, ChemIT sends a statement to that affect.

6) ChemIT sends a status report if cluster not up when expected, providing new time estimate.

7) ChemIT sends a report when the server is again available.

Ideas

  • Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?

Children Display
depth3
styleh3
excerpttrue

Emergency work procedures

See also

...

Communication timeline

1) Something bad happens, which was not scheduled.

...

4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.

Typical work done during maintenance

Update BIOS (and why it's done...)

Update OS

Update...

Test UPS

Test backups

Test...

 

 

A record (and notes) of Emergency Actions

 Loring

Cluster or HPC Cluster name

Event date
and action

     

Abruna

      

Ananth

      

Collum

      

Hoffmann

      

Lancaster (w/ Crane)

 

      
    

Scheraga

      

Widom-Loring

      

ChemIT (C4)Eldor