Excerpt |
---|
Clusters and other high performance servers require maintenance. Documented procedures reduce surprises for both enabling scheduled maintenance and emergency work. |
Scheduled maintenance and upgrades procedures
ChemIT notifies cluster lead that maintenance will occur on a specific upcoming date.
- What is a long enough lead time for the group?
- What is a short enough lead time for ChemIT?
Message will state:
- Date and time of shut-down. Expected duration of shut-down.
- Most events will occur Mon-Thur, 9am-5pm EST, when staffing and backup folks are available.
- Purpose summary.
Message will be sent to:
- Whom?
Sample message:
To: ?
Subject: PI's ClusterName: Date/ time planned down-time.
To all users of the PI's ClusterName,
On Date/ time, the cluster will be down for planned maintenance for 3 hours.
During this down-time, we intend to:
- Update the OS of the storage system.
- Update the BIOS of the 4 GPU clusters.
- Test new GPU software capabilities
Ideas
Establish a schedule for at least 6 month out. Why? Used by whom? What of things changing?
Emergency work procedures
ChemIT learn of the emergency situation.
- The most common is a power outage.
Typical work done during maintenance
...
Test UPS
Test backups
Test...
Scheduled maintenance procedures
...