Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Message will be sent to:

  • Whom?

Sample message:

To: ?
Subject: PI's ClusterName: Date/ time planned down-time.

-----------------------------------------------

To all users of the PI's ClusterName,

...

During this down-time, we intend to:

  • Test new GPU software capabilities
  • Update the OS of the storage system.
  • Update the BIOS of the 4 GPU clusters.
  • Test new GPU software capabilities

 

  • .
  • Update the UPS software to address current software's limitations.
  • Confirm backups and review other system software configurations.

-----------------------------------------------

Communication timeline

1) ChemIT notifies group rep. of planned date.

2) Group rep. confirms there is no better date (or negotiates a better date, with ChemIT staff).

3) Group rep. notifies all users of cluster, using message crafted by ChemIT.

  • Or, group rep. requests ChemIT send the email to all cluster users, on their behalf. Message sent to users' <NetID@cornell.edu> address.

4) The work day before the shut down, ChemIT sends a reminder.

5) When cluster is shutdown, ChemIT sends a statement to that affect.

6) ChemIT sends a status report if cluster not up when expected, providing new time estimate.

7) ChemIT sends a report when the server is again available. 

Ideas

  • Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?

Emergency work procedures

Communication timeline

1) Something bad happens, which was not scheduled.

  • Power outages have been the most frequent, recent reasons for downtime
    • Please talk to us to discuss cost/ benefits options to reduce power outage impacts, if you haven't already done so. Thank you!

2) ChemIT learns ChemIT learn of the emergency situation.

...

.

  • This often happens by researchers confirming cluster is down, and then emailing ChemIT with evidence.
    • ChemIT staff are available M-F, 9am-5 most days. Weekends, evenings, and holidays

3) ChemIT characterizes the problem and develops an initial prognosis.

4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.

Typical work done during maintenance

...