You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work.

Scheduled maintenance and upgrades procedures

ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.

  • What is a long enough lead time for the group?
  • What is a short enough lead time for ChemIT?

Message will state:

  • Date and time of shut-down. Expected duration of shut-down.
    • Most events will occur Mon-Thur, 9am-5pm EST, when staffing and backup folks are available.
  • Purpose summary.

Message will be sent to:

  • Whom?

Sample message:

To: ?
Subject: PI's ClusterName: Date/ time planned down-time.

To all users of the PI's ClusterName,

On Date/ time, the cluster will be down for planned maintenance for 3 hours.

During this down-time, we intend to:

  • Update the OS of the storage system.
  • Update the BIOS of the 4 GPU clusters.
  • Test new GPU software capabilities

 

 

Ideas

Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?

Emergency work procedures

ChemIT learn of the emergency situation.

  • The most common is a power outage.

Typical work done during maintenance

Update BIOS (and why it's done...)

Update OS

Update...

Test UPS

Test backups

Test...

 

 

Cluster name

Event date
and action

     

Abruna

      

Ananth

      

Collum

      

Hoffmann

      

Lancaster (w/ Crane)

      

Loring

      

Scheraga

      

Widom

      

ChemIT (C4)

      
  • No labels