You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work.

Table of contents

Scheduled maintenance and upgrades procedures

Summary

Details

ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.

  • What is a long enough lead time for the group?
  • What is a short enough lead time for ChemIT?

Message will state:

  • Date and time of shut-down. Expected duration of shut-down.
    • Most events will occur Mon-Thur, 9am-5pm EST, when staffing and backup folks are available.
  • Purpose summary.

Message will be sent to:

  • Whom?

Typical work done during maintenance

  • A proactive maintenance should be scheduled for approximately quarterly, and no longer than 6 months

Sample message:

To: ?
Subject: PI's ClusterName: Date/ time planned down-time.

-----------------------------------------------

To all users of the PI's ClusterName,

On Date/ time, the cluster will be down for planned maintenance for 3 hours.

During this down-time, we intend to:

  • Test new GPU software capabilities
  • Update the OS of the storage system.
  • Update the BIOS of the 4 GPU clusters.
  • Update the UPS software to address current software's limitations.
  • Confirm backups and review other system software configurations.

-----------------------------------------------

 

Emergency work procedures

See also

Communication timeline

1) Something bad happens, which was not scheduled.

  • Power outages have been the most frequent, recent reasons for downtime
    • Please talk to us to discuss cost/ benefits options to reduce power outage impacts, if you haven't already done so. Thank you!

2) ChemIT learns of the emergency situation.

  • This often happens by researchers confirming cluster is down, and then emailing ChemIT with evidence.
    • ChemIT staff are available M-F, 9am-5 most days. Weekends, evenings, and holidays

3) ChemIT characterizes the problem and develops an initial prognosis.

4) ChemIT notifies group rep (users) of status and prognosis as soon as practicable.

A record (and notes) of Emergency Actions

Cluster or HPC name

Event date
and action

     

Abruna

      

Ananth

      

Collum

      

Hoffmann

      

Lancaster (w/ Crane)

      

Scheraga

      

Widom-Loring

      

Eldor

      
  • No labels