Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work.

Table of contents

Table of Contents

Scheduled maintenance and upgrades procedures

Summary

Details

ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.

...

Message will be sent to:

  • Whom?

Typical work done during maintenance

(Point to Lulu's current, active checklist! Nov 2015)

...

Test UPS

Test backups

Test...

Sample message:

To: ?
Subject: PI's ClusterName: Date/ time planned down-time.

...

-----------------------------------------------

Communication timeline

1) ChemIT notifies group rep. of planned date.

...

7) ChemIT sends a report when the server is again available.

Ideas

  • Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?

Emergency work procedures

Communication timeline

1) Something bad happens, which was not scheduled.

...

4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.

Notes on Emergency Actions

Cluster or HPC name

Event date
and action

     

Abruna

      

Ananth

      

Collum

      

Hoffmann

      

Lancaster (w/ Crane)

      

Scheraga

      

Widom-Loring

      

Eldor